# Pre conditions

To start you should clone some files from repository

In [None]:
!git clone https://github.com/ISierov/pandas_numpy_guide.git

from pandas_numpy_guide.checker import check_answer

Cloning into 'pandas_numpy_guide'...
remote: Enumerating objects: 43, done.[K
remote: Counting objects: 100% (43/43), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 43 (delta 14), reused 19 (delta 5), pack-reused 0[K
Receiving objects: 100% (43/43), 40.09 KiB | 483.00 KiB/s, done.
Resolving deltas: 100% (14/14), done.


#Task Pandas working with data

You have been given a CSV file containing information about customer orders from an e-commerce website. You need to clean and process the data using pandas to answer the following questions:



1.   How many unique customers made orders?
2.   What was the total revenue from all orders?
3.   What was the average order value?

To solve this task, you can follow these steps:

Step 1: Import the required libraries
We will be using pandas for this task, so we need to import them first. You can use the following code:

In [None]:
import pandas as pd

Step 2: Load the data

Next, we need to load the CSV file into a pandas dataframe. You can use the <code>read_csv()</code> function to do this. Make sure that the file is located in the same directory as your Python script.

In [None]:
data = pd.read_csv('/content/pandas_numpy_guide/orders.csv')

Step 3: Clean the data

Before we can analyze the data, we need to clean it first. We can start by checking if there are any missing values in the dataframe. You can use the <code>isnull()</code> function to do this. If there are any missing values, you can either drop the rows or fill them with appropriate values. For this task, we will drop the rows with missing values.

In [None]:
data.dropna(inplace=True)

Step 4: Analyze the data

Now that we have cleaned the data, we can start analyzing it. To answer the first question, we need to count the number of unique customers in the dataframe. We can use the <code>nunique()</code> function for this.

In [None]:
unique_customers = data['CustomerID'].nunique()
print('Number of unique customers:', unique_customers)

To answer the second question, we need to calculate the total revenue from all orders. We can do this by multiplying the Quantity and UnitPrice columns and then summing up the values.

In [None]:
data['Total'] = data['Quantity'] * data['UnitPrice']
total_revenue = data['Total'].sum()
print('Total revenue:', total_revenue)

To answer the third question, we need to calculate the average order value. We can do this by dividing the total revenue by the number of orders.

In [None]:
num_orders = data['InvoiceNo'].nunique()
avg_order_value = total_revenue / num_orders
print('Average order value:', avg_order_value)

Step 5: Export the results

Finally, we can export the results to a CSV file. You can use the <code>to_csv()</code> function for this.

In [None]:
results = pd.DataFrame({'Number of unique customers': [unique_customers], 'Total revenue': [total_revenue], 'Average order value': [avg_order_value]})
results.to_csv('results1.csv', index=False)
print(results)

#Task Pandas working with data advanced

You have been given a dataset containing information about the sales of a retail store. Your task is to clean and process the data using pandas to answer the following questions:

1. Which product category had the highest sales revenue?
2. What was the average rating for products in each category?
3. What was the total profit for each region?


To solve this task, you can follow these steps:

Step 1: Import the required library

We imported it before so it step is need not


In [None]:
import pandas as pd

Step 2: Load the data

In [None]:
data = pd.read_csv('/content/pandas_numpy_guide/sales.csv')

Step 3: Clean the data

In [None]:
data.dropna(inplace=True)

Step 4: Analyze the data

Now that we have cleaned the data, we can start analyzing it. To answer the first question, we need to calculate the sales revenue for each product category and then find the category with the highest revenue. We can do this by multiplying the *Quantity* and *UnitPrice* columns to get the sales revenue, and then grouping the data by *Category*.

In [None]:
data['SalesRevenue'] = data['Quantity'] * data['UnitPrice']
sales_by_category = data.groupby('Category')['SalesRevenue'].sum()
highest_revenue_category = sales_by_category.idxmax()
print('Category with highest sales revenue:', highest_revenue_category)

To answer the second question, we need to calculate the average rating for products in each category. We can do this by grouping the data by *Category* and then calculating the mean of the *Rating* column.

In [None]:
avg_rating_by_category = data.groupby('Category')['Rating'].mean()
print('Average rating by category:\n', avg_rating_by_category)

To answer the third question, we need to calculate the total profit for each region. We can do this by multiplying the *Quantity* and *UnitCost* columns to get the total cost, and then subtracting the total cost from the sales revenue. We can then group the data by *Region* and calculate the sum of the profit.

In [None]:
data['TotalCost'] = data['Quantity'] * data['UnitCost']
data['Profit'] = data['SalesRevenue'] - data['TotalCost']
profit_by_region = data.groupby('Region')['Profit'].sum()
print('Total profit by region:\n', profit_by_region)

Step 5: Export the results

In [None]:
results = pd.DataFrame({'Category with highest sales revenue': [highest_revenue_category]})
results.to_csv('results2.csv', index=False)
print(results)

# Task Numpy arrays



You're already familiar with arrays, but the library supports custom arrays with advanced functionality. Some of the operations are of course similar, but even some may differ significantly.

Creating arrays: You can create arrays of different sizes and dimensions using numpy's *array* function.

In [None]:
import numpy as np

# Create a 1-dimensional array
a = np.array([1, 2, 3, 4, 5])
print(a)

# Create a 2-dimensional array
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(b)

Array indexing and slicing: You can index and slice arrays to access specific elements or subsets of an array.

In [None]:
import numpy as np

# Create a 1-dimensional array
a = np.array([1, 2, 3, 4, 5])

# Access the first element of the array
print(a[0])
# Access a slice of the array
print(a[1:4])

Array operations: You can perform mathematical operations on arrays, such as addition, subtraction, multiplication, division, and more.



In [None]:
import numpy as np

# Classic arrays
a = [1, 2, 3, 4, 5]
b = [5, 4, 3, 2, 1]

# Their sum
c = a + b
print('Classic arrays:', c)

# Arrays in NumPy
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])

# Add the two arrays
c = a + b
print('Numpy arrays sum:', c)

# Multiply the two arrays
d = a * b
print('Numpy arrays multiply:', d)

Array broadcasting: You can use broadcasting to perform operations on arrays of different shapes and sizes.

In [None]:
import numpy as np

a = np.array([[1, 2], [3, 4]])
b = np.array([10, 20])

# Multiply the two arrays
c = a * b[:, np.newaxis]
print(c)


Array reshaping: You can reshape arrays to change their dimensions and sizes.

In [None]:
import numpy as np

a = np.array([1, 2, 3, 4, 5, 6, 7, 8])

# Reshape the array to a 2-dimensional array
b = a.reshape(2, 4)
print(b)

#Task NumPy basic statistics

Write a program that takes a list of numbers as input and computes the mean, median, and standard deviation using the numpy library.

Step 1: Import numpy

In [None]:
import numpy as np

Step 2: Create a list of numbers

In [None]:
numbers = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]

Step 3: Calculate the mean

To calculate the *mean* of the list of numbers, you can use the mean function from the numpy library. The *mean* function takes an array-like object as its argument and returns the mean of the elements in the array.

In [None]:
mean = np.mean(numbers)
print("Mean:", mean)

Step 4: Calculate the median

To calculate the median of the list of numbers, you can use the *median* function from the numpy library. The *median* function takes an array-like object as its argument and returns the median of the elements in the array.

In [None]:
median = np.median(numbers)
print("Median:", median)

Step 5: Calculate the standard deviation

To calculate the standard deviation of the list of numbers, you can use the *std* function from the numpy library. The *std* function takes an array-like object as its argument and returns the standard deviation of the elements in the array.

In [None]:
std_dev = np.std(numbers)
print("Standard Deviation:", std_dev)

#Task NumPy Gaussian

A classic problem from the school curriculum about solving a system of equations:

- x + 2y = 3
- 2x - 6y = 12

Once upon a time, to solve it, you either added parts to get rid of one unknown, or expressed the unknown terms through each other. But when there are too many unknowns, problems arise with this approach. That is why we are considering a method for solving such systems with a large number of unknowns: Gaussian elimination method.

Write a program that solves a system of linear equations using Gaussian elimination method with numpy library.

Step 1: Import numpy library

We imported it before so it step is need not

In [None]:
import numpy as np

Step 2: Define the system of linear equations

In this task, we will define a system of linear equations to demonstrate the functionality of Gaussian elimination method with numpy. You can define a system of linear equations as a numpy array.

In [None]:
A = np.array([[2, 3, -1], [4, 4, -3], [2, -3, 1]])
B = np.array([[7], [5], [-1]])

This will define the system of linear equations as follows:
- 2x + 3y - z = 7
- 4x + 4y - 3z = 5
- 2x - 3y + z = -1

Step 3: Solve the system of linear equations using Gaussian elimination

To solve the system of linear equations using Gaussian elimination, you can use the *linalg.solve* function from the numpy library. The *linalg.solve* function takes a matrix of coefficients and a matrix of constants as its arguments and returns the solution to the system of linear equations.

In [None]:
x = np.linalg.solve(A, B)

Step 4: Print the result

In [None]:
print(x)

[[1.5]
 [2.6]
 [3.8]]


# Your turn!

You alredy have ***data.csv*** file with some information:

- IP
- Number of bytes
- Country
- Username

> ***Dont change records or fields names in file!***

Your task is to find

- the total number of bytes used

- the average number of bytes per request

- the most popular country (by the number of requests)

- the user who is on the 3rd place by the number of bytes

- the number of bytes used by users from Ukraine

- number of unique users

- the difference between the average number of bytes per request between users from Ukraine and the UK (rounded to the nearest whole number)

- average number of bytes per IP address

- total number of users from Europe (UK, France, Germany, Poland and Ukraine)

In [None]:
import pandas as pd

# Read the CSV file
df = pd.read_csv('pandas_numpy_guide/data.csv')

# Write your code here

In [None]:
# Check your answers
check_answer()

Incorrect
