### Problem Overview:

You are given a list of dictionaries. Each dictionary represents a sales transaction and contains three keys: 'id', 'product', and 'price'. 'id' represents the id of a transaction, 'product' represents the name of the product, and 'price' is the price of the product.

Your task is to write a function that filters out transactions with a price greater than a given threshold and returns a new list of dictionaries with the remaining transactions.

The function should return a new list of dictionaries. Each dictionary should contain the 'id', 'product', and 'price' for each transaction.

### Libraries Needed:

None

### Inputs:

The function will take the following inputs:

A list of dictionaries 'transactions'. Each dictionary contains three keys: 'id', 'product', and 'price'. 'id' is a string that represents the id of a transaction, 'product' is a string that represents the name of the product, and 'price' is a float that represents the price of the product.
A float 'threshold'. This represents the maximum price for a transaction to be included in the output.


### Expected Outputs:

The function should return a list of dictionaries. Each dictionary should contain three keys: 'id', 'product', and 'price'. The list should only include transactions where the 'price' is less than or equal to 'threshold'.

### Data:

In [1]:
transactions = [
    {"id": "t1", "product": "apple", "price": 1.0},
    {"id": "t2", "product": "banana", "price": 0.5},
    {"id": "t3", "product": "cherry", "price": 2.0},
]

In [14]:
def filter_under_price(transactions, thresh_price):
    filtered_transactions = list(filter(lambda x: x["price"] <= thresh_price, transactions))
    return filtered_transactions

In [15]:
filter_under_price(transactions, 1)

[{'id': 't1', 'product': 'apple', 'price': 1.0},
 {'id': 't2', 'product': 'banana', 'price': 0.5}]

#### If this were SQL

- Assuming table = transactions and columns = id, product, and price:

```sql
SELECT *
FROM transactions
WHERE price <= 1
```

**Problem Overview:**

You are given a list of dictionaries. Each dictionary represents a customer and contains three keys: 'id', 'age', and 'purchases'. 'id' represents the id of a customer, 'age' represents the age of the customer, and 'purchases' is a list of prices for each purchase made by the customer.

Your task is to write a function that calculates the average purchase price for each customer who is above a given age and returns a new list of dictionaries with the calculated averages.

The function should return a new list of dictionaries. Each dictionary should contain the 'id' and 'average_purchase' for each customer.

**Libraries Needed:**

```python
```

**Inputs:**

The function will take the following inputs:

1. A list of dictionaries 'customers'. Each dictionary contains three keys: 'id', 'age', and 'purchases'. 'id' is a string that represents the id of a customer, 'age' is an integer that represents the age of the customer, and 'purchases' is a list of floats that represent the prices of the purchases.
2. An integer 'age_limit'. This represents the minimum age for a customer to be included in the output.

**Expected Outputs:**

The function should return a list of dictionaries. Each dictionary should contain two keys: 'id' and 'average_purchase'. The list should only include customers whose age is greater than 'age_limit'. 'average_purchase' should be the average purchase price rounded to 2 decimal places.

**Data:**

```python
customers = [
    {"id": "c1", "age": 25, "purchases": [10.0, 20.0, 30.0]},
    {"id": "c2", "age": 30, "purchases": [15.0, 25.0, 35.0]},
    {"id": "c3", "age": 35, "purchases": [20.0, 30.0, 40.0]},
]
```

**Encrypted Solution:**

Here's the solution, encrypted with a Caesar cipher with a shift of 3 to the right:

```python
ghilqk_fdofoxodwh_dyhudjhfxvw_rphuvfxvw_rphuv, djh_olplw:
    ghilqk_fdofoxodwh_dyhudjh_sxufkdvhfxvw_rphu:
        dyhudjh_sxufkdvh = urxqg{vxp{fxvw_rphu{'sxufkdvhv'}}/ohq{fxvw_rphu{'sxufkdvhv'}}, 2)
        uhwxuq {'lg': fxvw_rphu{'lg'}, 'dyhudjh_sxufkdvh': dyhudjh_sxufkdvh}
    
    uhvxowv = olvw{pdskh{fdofoxodwh_dyhudjh_sxufkdvh, ilowhu{odpegd{fxvw_rphu: fxvw_rphu{'djh'} juhdwhu wkdq djh_olplw, fxvw_rphuv}})}
    uhwxuq uhvxowv
```

The shift of 3 letters to the right was applied to every letter of the solution, but not to special characters, digits or whitespaces.

In [2]:
import pandas as pd

In [3]:
customers = [
    {"id": "c1", "age": 25, "purchases": [10.0, 20.0, 30.0]},
    {"id": "c2", "age": 30, "purchases": [15.0, 25.0, 35.0]},
    {"id": "c3", "age": 35, "purchases": [20.0, 30.0, 40.0]},
]

In [20]:
def mean_of_list(list_of_float):
    return sum(list_of_float) / len(list_of_float)

In [35]:
def average_price_per_aged_customer(customer_list, minimum_age):
    return [{"id": customer['id'],
             'avg_purchase': mean_of_list(customer['purchases'])
             } for customer in customer_list if customer['age'] >= minimum_age]

print(average_price_per_aged_customer(customers, 0))
print(average_price_per_aged_customer(customers, 30))
print(average_price_per_aged_customer(customers, 35))
print(average_price_per_aged_customer(customers, 40))

[{'id': 'c1', 'avg_purchase': 20.0}, {'id': 'c2', 'avg_purchase': 25.0}, {'id': 'c3', 'avg_purchase': 30.0}]
[{'id': 'c2', 'avg_purchase': 25.0}, {'id': 'c3', 'avg_purchase': 30.0}]
[{'id': 'c3', 'avg_purchase': 30.0}]
[]


**If this were SQL**

```sql
SELECT id, AVG(purchases)
FROM customers
WHERE MAX(age) >= 25
GROUP BY id;
```

Problem Overview:

You are given a dataset of products with their names, categories, and prices. However, some of the product names contain extra spaces and some prices are missing (represented as None). Your task is to write a function that will clean the dataset by:

Removing leading and trailing whitespaces from product names.
Replacing any missing prices (None) with the average price of products in the same category.
The function should return a cleaned list of dictionaries with the updated product names and prices.

Inputs:

The function will take the following input:

A list of dictionaries 'products'. Each dictionary contains three keys: 'name', 'category', and 'price'. 'name' is a string that represents the name of the product, 'category' is a string that represents the category of the product, and 'price' is a float that represents the price of the product (or None if the price is missing).
Expected Outputs:

The function should return a list of dictionaries. Each dictionary should contain three keys: 'name', 'category', and 'price'. The 'name' should have leading and trailing whitespaces removed, and any 'price' that was None should be replaced with the average price of other products in the same category, rounded to 2 decimal places.

Data:

In [36]:
products = [
    {"name": " apple ", "category": "fruit", "price": 1.0},
    {"name": "banana", "category": "fruit", "price": None},
    {"name": "cherry  ", "category": "fruit", "price": 2.0},
    {"name": " lettuce", "category": "vegetable", "price": 1.5},
    {"name": "carrot", "category": "vegetable", "price": None},
]


In [37]:
def clean_products(products):
    categories_average = {}
    for product in products:
        category_name = product['category']
        price = product['price']
        key = category_name
        key_average = categories_average.get(key, [0, 0]) 
        key_average[0] += price if price is not None else 0
        key_average[1] += 1
        categories_average[key] = key_average
    
    for key in categories_average:
        total, count = categories_average[key]
        categories_average[key] = total / count if count else 0
    
    for product in products:
        product['name'] = product['name'].strip()
        price = product['price']
        category_name = product['category']
        key = category_name
        key_average = categories_average[key]
        if price is None:
            product['price'] = round(key_average, 2)
    
    return products


In [38]:

clean_products(products)

[{'name': 'apple', 'category': 'fruit', 'price': 1.0},
 {'name': 'banana', 'category': 'fruit', 'price': 1.0},
 {'name': 'cherry', 'category': 'fruit', 'price': 2.0},
 {'name': 'lettuce', 'category': 'vegetable', 'price': 1.5},
 {'name': 'carrot', 'category': 'vegetable', 'price': 0.75}]

Objective:
Background:

You are a Data Scientist at a retail company and you have been given a dataset containing transaction data. This data includes the transaction date, product ID, quantity, and price. Unfortunately, some of the date values are missing and are marked as None. You need to impute these missing dates using the median date of the entire dataset.

Question:

Write a function that takes the transaction data and imputes the missing dates with the median date. Make sure to convert the dates from string to Python datetime objects and then back to string in the 'YYYY-MM-DD' format after imputation.

Libraries Needed:

python
Copy code
from datetime import datetime
Inputs:

A list of dictionaries named transactions. Each dictionary contains:
'date': a string representing the date of the transaction in 'YYYY-MM-DD' format or None if the date is missing.
'product_id': a string representing the product ID.
'quantity': an integer representing the quantity of the product.
'price': a float representing the price of the product.
Expected Outputs:

The function should return a list of dictionaries with the same structure as the input, but with all missing dates imputed with the median date of the dataset.
Data:
python
Copy code
transactions = [
    {"date": "2022-07-01", "product_id": "P1", "quantity": 10, "price": 5.5},
    {"date": None, "product_id": "P2", "quantity": 8, "price": 6.0},
    {"date": "2022-07-03", "product_id": "P3", "quantity": 15, "price": 7.0},
    {"date": "2022-07-02", "product_id": "P4", "quantity": 12, "price": 6.5},
    {"date": None, "product_id": "P5", "quantity": 9, "price": 5.0},
]
Solution:

In [5]:
import pandas as pd
import numpy as np

# Read in the data
transactions = [
    {"date": "2022-07-01", "product_id": "P1", "quantity": 10, "price": 5.5},
    {"date": None, "product_id": "P2", "quantity": 8, "price": 6.0},
    {"date": "2022-07-03", "product_id": "P3", "quantity": 15, "price": 7.0},
    {"date": "2022-07-02", "product_id": "P4", "quantity": 12, "price": 6.5},
    {"date": None, "product_id": "P5", "quantity": 9, "price": 5.0},
]
df = pd.DataFrame(transactions)

In [7]:
def imput_missing_dates(df):
    df['date'] = pd.to_datetime(df['date'])
    median_date = df['date'].median()
    df['date'] = df['date'].fillna(median_date)
    return df

imput_missing_dates(df)

Unnamed: 0,date,product_id,quantity,price
0,2022-07-01,P1,10,5.5
1,2022-07-02,P2,8,6.0
2,2022-07-03,P3,15,7.0
3,2022-07-02,P4,12,6.5
4,2022-07-02,P5,9,5.0


### Problem Overview:

You are working as a Data Scientist for a retail company. The company stores data about customer purchases in a DataFrame. Each row in the DataFrame represents a unique purchase and contains the customer's ID, the date of the purchase, the purchased item's ID, and the item's price.

However, the data is not clean. The item prices are stored as strings with a dollar sign, and some of the purchase dates are missing. Your task is to clean the DataFrame by converting the item prices to floats and imputing the missing dates with the median date.

### Libraries Needed:

python

import pandas as pd
import numpy as np
### Inputs:

A pandas DataFrame named df. The DataFrame has the following columns:

'customer_id': a string representing the customer's ID.
'date': a string representing the date of the purchase in the 'YYYY-MM-DD' format or None if the date is missing.
'item_id': a string representing the purchased item's ID.
'price': a string representing the price of the item with a dollar sign.
### Expected Outputs:

The function should return a cleaned DataFrame with the same structure as the input, but with the item prices converted to floats and all missing dates imputed with the median date.

### Data:

Let's create a large DataFrame with 1,000 rows. For simplicity, we can randomly generate the data:

In [17]:
import pandas as pd
import numpy as np

np.random.seed(0)

# Generate 1,000 rows of data
n_rows = 1000
customer_ids = [f"C{i}" for i in np.random.randint(1, 100, n_rows)]
dates = pd.date_range(start='2022-01-01', end='2022-12-31').to_list()
dates = np.random.choice(dates + [None] * len(dates), n_rows).tolist()
item_ids = [f"I{i}" for i in np.random.randint(1, 50, n_rows)]
prices = [f"${i:.2f}" for i in np.random.uniform(1, 100, n_rows)]

# Create DataFrame
df = pd.DataFrame({
    'customer_id': customer_ids,
    'date': dates,
    'item_id': item_ids,
    'price': prices,
})


In [18]:
df

Unnamed: 0,customer_id,date,item_id,price
0,C45,NaT,I29,$64.74
1,C48,2022-09-19,I15,$50.68
2,C65,2022-02-08,I15,$81.34
3,C68,NaT,I17,$48.13
4,C68,2022-03-30,I36,$52.79
...,...,...,...,...
995,C6,2022-12-04,I44,$88.09
996,C39,2022-10-27,I26,$29.21
997,C39,2022-11-20,I21,$94.23
998,C66,NaT,I48,$55.07


In [19]:
def clean_data(dataframe):
    #First get the median date and fill the missing values with it
    median_date = dataframe['date'].median()
    dataframe['date'] = dataframe['date'].fillna(median_date)

    #Convert the price column to float after removing $
    dataframe['price'] = dataframe['price'].str.replace('$','').str.replace(',','').astype(float)
    return dataframe


In [20]:
clean_data(df)

Unnamed: 0,customer_id,date,item_id,price
0,C45,2022-07-01,I29,64.74
1,C48,2022-09-19,I15,50.68
2,C65,2022-02-08,I15,81.34
3,C68,2022-07-01,I17,48.13
4,C68,2022-03-30,I36,52.79
...,...,...,...,...
995,C6,2022-12-04,I44,88.09
996,C39,2022-10-27,I26,29.21
997,C39,2022-11-20,I21,94.23
998,C66,2022-07-01,I48,55.07


### Objective:

### Background:
You are a data scientist at a retail company, and you are tasked with analyzing the sales data of various products over different months. You need to identify the products that show a significant upward trend in sales.

### Question:
Write a Python function identify_trending_products(sales_data: pd.DataFrame) -> List[str] that takes a pandas DataFrame containing sales data and returns a list of product names that show a statistically significant upward trend in sales over months.

The input DataFrame sales_data has the following columns:

'product_name': (str) the name of the product
'month': (int) the month of the sale (from 1 to 12)
'sales': (float) the sales value for the product in that month
Return a list of product names that show a significant upward trend in sales over months. A product is considered to have a significant upward trend if the p-value of its linear regression slope is less than 0.05.

### Inputs:
sales_data: a pandas DataFrame with columns 'product_name' (str), 'month' (int), and 'sales' (float).
Outputs:
A list of strings containing the names of products that show a statistically significant upward trend in sales.
### Libraries Needed:
python
Copy code
import pandas as pd
from scipy.stats import linregress
### Data:

In [2]:
import pandas as pd
data = pd.DataFrame({
  "product_name": ["Widget A", "Widget A", "Widget A", "Widget B", "Widget B", "Widget B", "Widget C", "Widget C", "Widget C"],
  "month": [1, 2, 3, 1, 2, 3, 1, 2, 3],
  "sales": [100.0, 150.0, 200.0, 50.0, 45.0, 40.0, 5.0, 10.0, 15.0]
})

In [3]:
data

Unnamed: 0,product_name,month,sales
0,Widget A,1,100.0
1,Widget A,2,150.0
2,Widget A,3,200.0
3,Widget B,1,50.0
4,Widget B,2,45.0
5,Widget B,3,40.0
6,Widget C,1,5.0
7,Widget C,2,10.0
8,Widget C,3,15.0


In [23]:
from scipy.stats import linregress
import pandas as pd

def identify_trending_products(sales_data: pd.DataFrame) -> list:
    trending_products = []  # List to store names of trending products
    grouped_sales = sales_data.groupby('product_name')

    for product_name, product_data in grouped_sales:
        result = linregress(product_data['month'], product_data['sales'])
        if (result.pvalue < 0.05) and (result.slope > 0): #Product Growth with P Value
            trending_products.append(product_name)
            
    return trending_products


In [26]:
growing_products = identify_trending_products(data)
growing_products

['Widget A', 'Widget C']

### Problem Overview:

You are working as a Data Scientist for a retail company. The company stores data about customer purchases in a DataFrame. Each row in the DataFrame represents a unique purchase and contains the customer's ID, the date of the purchase, the purchased item's ID, and the item's price.

However, the data is not clean. The item prices are stored as strings with a dollar sign, and some of the purchase dates are missing. Your task is to clean the DataFrame by converting the item prices to floats and imputing the missing dates with the median date.

### Libraries Needed:

```python

import pandas as pd
import numpy as np
```
### Inputs:

A pandas DataFrame named df. The DataFrame has the following columns:

- 'customer_id': a string representing the customer's ID.
- 'date': a string representing the date of the purchase in the 'YYYY-MM-DD' format or None if the date is missing.
- 'item_id': a string representing the purchased item's ID.
- 'price': a string representing the price of the item with a dollar sign.
### Expected Outputs:

The function should return a cleaned DataFrame with the same structure as the input, but with the item prices converted to floats and all missing dates imputed with the median date.

### Data:

Let's create a large DataFrame with 1,000 rows. For simplicity, we can randomly generate the data:

```python

import pandas as pd
import numpy as np

np.random.seed(0)

# Generate 1,000 rows of data
n_rows = 1000
customer_ids = [f"C{i}" for i in np.random.randint(1, 100, n_rows)]
dates = pd.date_range(start='2022-01-01', end='2022-12-31').to_list()
dates = np.random.choice(dates + [None] * len(dates), n_rows).tolist()
item_ids = [f"I{i}" for i in np.random.randint(1, 50, n_rows)]
prices = [f"${i:.2f}" for i in np.random.uniform(1, 100, n_rows)]

# Create DataFrame
df = pd.DataFrame({
    'customer_id': customer_ids,
    'date': dates,
    'item_id': item_ids,
    'price': prices,
})
```
### Encrypted Solution:

Here's the solution, encrypted with a Caesar cipher with a shift of 3 to the right:

```python

ghilqk_fohdq_gdwbdig:
    # Frqyhuw wkh sulpdwb iulfhv wkh iordwv dqg
    # uhpryh wkh groodu vljqv
    dig['sulpd'] = dig['sulpd'].vwulsb{1:}.dvwbsh{ioraw}
    
    # Frqyhqw wkh gdwhv wkh gdwhwbph rEMhfwv
    dig['gdwh'] = sg.gdwhwbph{dig['gdwh']}
    
    # Lpsxwh wkh plvvlqj gdwhv zlwk wkh phglbdq gdwh
    phglbdq_gdwh = dig['gdwh'].phglbdqbrs
    dig['gdwh'].iloobbr{phglbdq_gdwh, lqsoadh=Wrxd}
    
    # Frqyhqw wkh gdwhv edfn wkh vwulqjv
    dig['gdwh'] = dig['gdwh'].gvwbsh{\'BBBB-PP-GG\'}
    
    uhwxuq dig
```
The shift of 3 letters to the right was applied to every letter of the solution, but not to special characters, digits, or whitespaces.

In [1]:
import pandas as pd
import numpy as np

np.random.seed(0)

# Generate 1,000 rows of data
n_rows = 1000
customer_ids = [f"C{i}" for i in np.random.randint(1, 100, n_rows)]
dates = pd.date_range(start='2022-01-01', end='2022-12-31').to_list()
dates = np.random.choice(dates + [None] * len(dates), n_rows).tolist()
item_ids = [f"I{i}" for i in np.random.randint(1, 50, n_rows)]
prices = [f"${i:.2f}" for i in np.random.uniform(1, 100, n_rows)]

# Create DataFrame
df = pd.DataFrame({
    'customer_id': customer_ids,
    'date': dates,
    'item_id': item_ids,
    'price': prices,
})

In [2]:
df

Unnamed: 0,customer_id,date,item_id,price
0,C45,NaT,I29,$64.74
1,C48,2022-09-19,I15,$50.68
2,C65,2022-02-08,I15,$81.34
3,C68,NaT,I17,$48.13
4,C68,2022-03-30,I36,$52.79
...,...,...,...,...
995,C6,2022-12-04,I44,$88.09
996,C39,2022-10-27,I26,$29.21
997,C39,2022-11-20,I21,$94.23
998,C66,NaT,I48,$55.07


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   customer_id  1000 non-null   object        
 1   date         518 non-null    datetime64[ns]
 2   item_id      1000 non-null   object        
 3   price        1000 non-null   object        
dtypes: datetime64[ns](1), object(3)
memory usage: 31.4+ KB


In [8]:
def clean_the_data(df):
    median_date = df['date'].median()
    df['date'] = df['date'].fillna(median_date)

    #Remove the '$' symbol from each price
    df['price'] = df['price'].str.replace('$', '').astype(float)
    return df

In [9]:
clean_data = clean_the_data(df)

In [11]:
clean_data.price.describe()

count    1000.00000
mean       51.00319
std        28.38245
min         1.01000
25%        27.87250
50%        50.56000
75%        75.77000
max       100.00000
Name: price, dtype: float64

### Objective:
You are tasked with analyzing customer feedback data to identify trends and sentiments. The dataset contains customer reviews in text format, along with corresponding ratings on a scale of 1 to 5. Your goal is to preprocess the text data, perform sentiment analysis, and generate a summary report.

### Libraries Needed:
You will need the following libraries:

In [9]:
import pandas as pd
import numpy as np
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer


### Data:
You are provided with a JSON file named "customer_feedback.json" that contains the following columns: "review_text" (textual customer reviews) and "rating" (integer ratings from 1 to 5).

### Inputs:
Columns: "review_text" (str), "rating" (int)

### Expected Outputs:
A summary report in the form of a DataFrame with the following columns: "Rating", "Positive Reviews", "Neutral Reviews", "Negative Reviews".
For each rating (1 to 5), count the number of reviews classified as positive, neutral, and negative based on sentiment analysis.

In [10]:
json_data = [
    {
        "review_text": "Great product, very satisfied with my purchase.",
        "rating": 5
    },
    {
        "review_text": "The quality is not up to my expectations.",
        "rating": 2
    },
  {
    "review_text": "Great product, very satisfied with my purchase.",
    "rating": 5
  },
  {
    "review_text": "The quality is not up to my expectations.",
    "rating": 2
  },
  {
    "review_text": "Fast delivery and excellent customer service.",
    "rating": 4
  },
  {
    "review_text": "I would not recommend this product to others.",
    "rating": 1
  },
  {
    "review_text": "Average performance, nothing exceptional.",
    "rating": 3
  },
  {
    "review_text": "Outstanding experience with the product and seller.",
    "rating": 5
  },
  {
    "review_text": "Terrible quality, fell apart after a few uses.",
    "rating": 1
  },
  {
    "review_text": "Good value for the price.",
    "rating": 4
  },
  {
    "review_text": "Not bad, but could be better.",
    "rating": 3
  },
  {
    "review_text": "Best purchase I've made in a while!",
    "rating": 5
  },
  {
    "review_text": "Extremely disappointed, regret buying it.",
    "rating": 1
  },
  {
    "review_text": "Satisfied with the overall performance.",
    "rating": 4
  },
  {
    "review_text": "Could use some improvements, but it's decent.",
    "rating": 3
  },
  {
    "review_text": "Absolutely fantastic, exceeded my expectations!",
    "rating": 5
  },
  {
    "review_text": "Avoid at all costs, waste of money.",
    "rating": 1
  },
  {
    "review_text": "Reasonably good quality for the price.",
    "rating": 4
  },
  {
    "review_text": "Meh, not impressed.",
    "rating": 2
  },
  {
    "review_text": "Highly recommended, top-notch product!",
    "rating": 5
  },
  {
    "review_text": "The worst thing I've ever purchased.",
    "rating": 1
  },
  {
    "review_text": "Decent experience, nothing extraordinary.",
    "rating": 3
  },
  {
    "review_text": "Impressed with the features and performance.",
    "rating": 4
  },
  {
    "review_text": "Absolutely awful, total waste of money.",
    "rating": 1
  },
  {
    "review_text": "Solid product, reliable and functional.",
    "rating": 4
  },
  {
    "review_text": "Could have been better, not too happy.",
    "rating": 2
  },
  {
    "review_text": "Exceeded my expectations, very satisfied!",
    "rating": 5
  },
  {
    "review_text": "Disappointed with the quality, broke easily.",
    "rating": 2
  },
  {
    "review_text": "Good purchase, serves its purpose well.",
    "rating": 4
  },
  {
    "review_text": "Waste of money, won't buy again.",
    "rating": 1
  },
  {
    "review_text": "Average quality, nothing special.",
    "rating": 3
  },
  {
    "review_text": "Highly impressed, great value for the price.",
    "rating": 5
  },
  {
    "review_text": "Awful product, regret buying it.",
    "rating": 1
  },
  {
    "review_text": "Satisfactory performance, met my needs.",
    "rating": 4
  },
  {
    "review_text": "Not very good, needs improvement.",
    "rating": 2
  },
  {
    "review_text": "Exceptional product, exceeded expectations.",
    "rating": 5
  },
  {
    "review_text": "Complete waste of money, do not recommend.",
    "rating": 1
  },
  {
    "review_text": "Decent value for the price paid.",
    "rating": 3
  },
  {
    "review_text": "Not impressed, would not buy again.",
    "rating": 2
  },
  {
    "review_text": "Absolutely amazing, worth every penny!",
    "rating": 5
  },
  {
    "review_text": "Horrible quality, fell apart quickly.",
    "rating": 1
  },
  {
    "review_text": "Good performance, satisfied overall.",
    "rating": 4
  },
  {
    "review_text": "Could be better, but it's acceptable.",
    "rating": 3
  },
  {
    "review_text": "Incredible product, highly recommended!",
    "rating": 5
  },
  {
    "review_text": "Avoid this product, not worth it.",
    "rating": 1
  },
  {
    "review_text": "Reasonable quality, met expectations.",
    "rating": 3
  },
  {
    "review_text": "Below average, disappointed.",
    "rating": 2
  },
  {
    "review_text": "Top-notch quality, very satisfied!",
    "rating": 5
  },
  {
    "review_text": "Worst purchase ever, regretting it.",
    "rating": 1
  },
  {
    "review_text": "Average performance, nothing exceptional.",
    "rating": 3
  },
  {
    "review_text": "Good value for the money spent.",
    "rating": 4
  },
  {
    "review_text": "Not up to par, needs improvement.",
    "rating": 2
  },
  {
    "review_text": "Absolutely outstanding, thrilled!",
    "rating": 5
  },
  {
    "review_text": "Total waste of money, avoid.",
    "rating": 1
  },
  {
    "review_text": "Met my expectations, decent product.",
    "rating": 4
  },
  {
    "review_text": "Could have been better, not satisfied.",
    "rating": 2
  },
  {
    "review_text": "Impressed beyond words, excellent!",
    "rating": 5
  },
  {
    "review_text": "Horrible quality, fell apart quickly.",
    "rating": 1
  },
  {
    "review_text": "Satisfactory performance, met my needs.",
    "rating": 4
  },
  {
    "review_text": "Disappointing purchase, regretting it.",
    "rating": 1
  },
  {
    "review_text": "Average quality, nothing special.",
    "rating": 3
  },
  {
    "review_text": "Top-notch product, highly satisfied!",
    "rating": 5
  }
]

# Create a DataFrame from the JSON data


In [11]:
df = pd.DataFrame(json_data)
df

Unnamed: 0,review_text,rating
0,"Great product, very satisfied with my purchase.",5
1,The quality is not up to my expectations.,2
2,"Great product, very satisfied with my purchase.",5
3,The quality is not up to my expectations.,2
4,Fast delivery and excellent customer service.,4
...,...,...
57,"Horrible quality, fell apart quickly.",1
58,"Satisfactory performance, met my needs.",4
59,"Disappointing purchase, regretting it.",1
60,"Average quality, nothing special.",3


In [12]:
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/danmarino/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [13]:
# Define a function to categorize sentiment
def get_sentiment_label(score):
    if score > 0.05:
        return "Positive"
    elif score < -0.05:
        return "Negative"
    else:
        return "Neutral"

df['sentiment'] = df['review_text'].apply(lambda x: get_sentiment_label(sia.polarity_scores(x)['compound']))

In [16]:
summary_report = pd.pivot_table(df, index='rating', columns='sentiment', values='review_text', aggfunc='count', fill_value=0)
summary_report.columns.name = None
summary_report.reset_index(inplace=True)

# Rename columns for clarity
summary_report.rename(columns={'Positive': 'Positive Reviews', 'Neutral': 'Neutral Reviews', 'Negative': 'Negative Reviews'}, inplace=True)

# Display the summary report
summary_report

Unnamed: 0,rating,Negative Reviews,Neutral Reviews,Positive Reviews
0,1,15,0,0
1,2,4,3,3
2,3,2,4,4
3,4,0,1,11
4,5,0,1,14


### The sentiment on the reviews aligns well with the ratings

## Objective:
You work for a retail company and have been provided with a JSON dataset containing sales data. Your task is to analyze the dataset and extract some insights regarding the sales performance. Specifically, you need to calculate the total sales revenue for each product category and identify the category with the highest total revenue.

## Data:
Below is a snippet of the JSON dataset representing sales data:

```json
[
    {"product": "Product A", "category": "Electronics", "price": 100, "quantity": 5},
    {"product": "Product B", "category": "Clothing", "price": 50, "quantity": 10},
    {"product": "Product C", "category": "Electronics", "price": 120, "quantity": 8},
    {"product": "Product D", "category": "Grocery", "price": 10, "quantity": 20},
    // ... more data
]
```
## Output 
Your function should return a dictionary containing the total sales revenue for each product category. For example:
```
python
Copy code
{
    "Electronics": 1360,  # Total revenue for Electronics category
    "Clothing": 500,      # Total revenue for Clothing category
    "Grocery": 200,       # Total revenue for Grocery category
    // ... more categories
}
```

In [2]:
import json

def calculate_category_revenue(data):
    revenue_by_category = {}
    for record in data:
        category = record["category"]
        price = record["price"]
        quantity = record["quantity"]
        total_revenue = price * quantity
        if category in revenue_by_category:
            revenue_by_category[category] += total_revenue
        else:
            revenue_by_category[category] = total_revenue
    return revenue_by_category

# Load the JSON dataset
data = [
    {"product": "Product A", "category": "Electronics", "price": 100, "quantity": 5},
    {"product": "Product B", "category": "Clothing", "price": 50, "quantity": 10},
    {"product": "Product C", "category": "Electronics", "price": 120, "quantity": 8},
    {"product": "Product D", "category": "Grocery", "price": 10, "quantity": 20},
]

# Call the function and print the result
revenue_result = calculate_category_revenue(data)
print(revenue_result)


{'Electronics': 1460, 'Clothing': 500, 'Grocery': 200}


## Objective - 

### Background:
As a Data Scientist working for a retail company, you are tasked with cleaning and preprocessing the sales data for further analysis. You want to create some additional features from the existing dataset to understand sales trends. 

### Question:
Write a function called `process_sales_data` that takes a pandas DataFrame containing sales information and returns a new DataFrame with the following transformations:

1. Replace missing values in 'Discount' column with the median discount.
2. Extract the numerical part of the 'Product_ID' (e.g. if 'Product_ID': 'P_123', extract 123) and create a new column 'Product_Number'.
3. Create a new column 'Total_Sales' that is the product of 'Quantity' and 'Price'.
4. Round 'Total_Sales' to two decimal places.
5. Filter the rows where 'Total_Sales' is greater than a given threshold.

### Libraries Needed:
- pandas 
- numpy
- re (for regex)

### Inputs:
- `sales_data` (pandas DataFrame): The DataFrame containing sales information with the following columns:
  - 'Order_ID' (str): The unique order ID.
  - 'Product_ID' (str): The product ID, which contains a letter prefix followed by a numerical ID (e.g., 'P_123').
  - 'Quantity' (int): The quantity of the product ordered.
  - 'Price' (float): The price of the product.
  - 'Discount' (float): The discount applied to the order. There may be missing values.
- `threshold` (float): The threshold for filtering rows based on 'Total_Sales'.

### Outputs:
- A new pandas DataFrame with the specified transformations, containing at least the following columns:
  - 'Order_ID'
  - 'Product_Number'
  - 'Total_Sales'


In [10]:
## Data:
# Example sales data in JSON format

sales_data_json = """
{
    "Order_ID": ["O_1001", "O_1002", "O_1003", "O_1004"],
    "Product_ID": ["P_123", "P_456", "P_789", "P_111"],
    "Quantity": [2, 4, 1, 3],
    "Price": [50.25, 30.75, 20.50, 25.00],
    "Discount": [5.00, null, 2.50, null]
}
"""
import pandas as pd
import json
import numpy as np

sales_data = json.loads(sales_data_json)
sales_df = pd.DataFrame(sales_data)
sales_df

Unnamed: 0,Order_ID,Product_ID,Quantity,Price,Discount
0,O_1001,P_123,2,50.25,5.0
1,O_1002,P_456,4,30.75,
2,O_1003,P_789,1,20.5,2.5
3,O_1004,P_111,3,25.0,


In [11]:
def clean_the_sales_data(sales_df, threshold):
    #Manage the missing values
    median_discount = sales_df['Discount'].median()
    sales_df['Discount'].fillna(median_discount, inplace=True)
    # Get the numbers from the order id
    sales_df['Product Number'] = sales_df['Product_ID'].str[2:].astype('int64')
    sales_df['Order Number'] = sales_df['Order_ID'].str[2:].astype('int64')
    #Create Total Sales
    sales_df['Total Sales'] = sales_df['Quantity'] * sales_df['Price']
    sales_df['Total Sales'] = sales_df['Total Sales'].round(2)

    #Filter
    sales_df = sales_df.loc[sales_df['Total Sales'] > threshold]
    return sales_df

clean_the_sales_data(sales_df, 25)


Unnamed: 0,Order_ID,Product_ID,Quantity,Price,Discount,Product Number,Order Number,Total Sales
0,O_1001,P_123,2,50.25,5.0,123,1001,100.5
1,O_1002,P_456,4,30.75,3.75,456,1002,123.0
3,O_1004,P_111,3,25.0,3.75,111,1004,75.0


## Objective:
### Background:
You are working as a Data Scientist for a company that specializes in analyzing customer reviews for various products. Your task is to write a Python function that extracts the top 'n' frequently used words from a given set of customer reviews and their corresponding frequencies. This will help the marketing team understand the customers' sentiment towards the products.

### Question:
Write a Python function called `top_frequent_words` that takes a list of customer reviews (strings) and an integer 'n', and returns a dictionary containing the top 'n' frequently used words and their corresponding frequencies. Ignore common stop words like 'and', 'the', 'is', etc., and consider only alphanumeric words.

### Libraries Needed:
- collections

### Inputs:
- reviews (List[str]): A list of customer reviews, where each review is represented as a string. The list can contain up to 10,000 reviews, and each review can have up to 500 characters.
- n (int): The number of top frequent words to be returned. It can range from 1 to 50.

### Expected Outputs:
- output (Dict[str, int]): A dictionary containing the top 'n' frequently used words and their corresponding frequencies.


In [2]:
def top_frequent_words(reviews, word_count):
    words_and_counts = {}
    for review in reviews:
        for word in review.split():
            if word in words_and_counts:
                words_and_counts[word] += 1
            else:
                words_and_counts[word] = 1
    return sorted(words_and_counts.items(), key=lambda x: x[1], reverse=True)[:word_count]


In [3]:
## Data:
reviews = [
    "I love this product! It's amazing and works like a charm.",
    "The quality is superb, and the design is sleek.",
    "Amazing product! I would recommend it to everyone.",
    "This is the best product I've ever bought.",
    "Love the design and the quality. Highly recommended!",
    "I'm disappointed with the product. It broke within a week.",
    "Not worth the price. Quality is subpar.",
    "Amazing quality for the price. I'm highly satisfied.",
    "The product is good but not great. It does the job.",
    "I'm in love with this product! Best purchase ever."
]

In [5]:
top_frequent_words(reviews, 20)

[('the', 8),
 ('is', 5),
 ('product!', 3),
 ('and', 3),
 ("I'm", 3),
 ('I', 2),
 ('love', 2),
 ('this', 2),
 ('a', 2),
 ('The', 2),
 ('quality', 2),
 ('design', 2),
 ('Amazing', 2),
 ('product', 2),
 ('with', 2),
 ('It', 2),
 ('price.', 2),
 ("It's", 1),
 ('amazing', 1),
 ('works', 1)]

## Objective:
### Background:
You are a data scientist working for a retail company. You have been provided with a dataset containing information about sales transactions. Your task is to write a Python function to calculate the total sales and average sales for each product category and for each quarter of the year.

### Question:
Write a Python function `quarterly_sales_analysis(sales_data: pd.DataFrame) -> Dict[str, Dict[str, Union[float, int]]]` that takes a pandas DataFrame as input. The DataFrame contains the following columns: 'Product_Category', 'Sales_Amount', 'Transaction_Date'. The function should return a dictionary with the total sales and average sales for each product category and for each quarter of the year.

### Inputs:
- `sales_data`: A pandas DataFrame with the following columns:
    - 'Product_Category': string, the category of the product
    - 'Sales_Amount': float, the sales amount for the transaction
    - 'Transaction_Date': string, the date of the transaction in the format 'YYYY-MM-DD'

### Outputs:
- A dictionary with keys representing the product categories and values being dictionaries containing:
    - 'Q1_Total': float, total sales for Q1
    - 'Q1_Average': float, average sales for Q1
    - 'Q2_Total': float, total sales for Q2
    - 'Q2_Average': float, average sales for Q2
    - 'Q3_Total': float, total sales for Q3
    - 'Q3_Average': float, average sales for Q3
    - 'Q4_Total': float, total sales for Q4
    - 'Q4_Average': float, average sales for Q4

### Libraries Needed:
- pandas
- numpy


In [5]:
import pandas as pd
import numpy as np
import datetime as dt

In [6]:
## Data:
data = pd.DataFrame({
    "Product_Category": ["Electronics", "Clothing", "Electronics", "Clothing", "Electronics"],
    "Sales_Amount": [100.50, 50.25, 200.75, 40.00, 300.00],
    "Transaction_Date": ["2022-01-15", "2022-01-20", "2022-04-10", "2022-07-05", "2022-10-30"]
}
)

In [7]:
data

Unnamed: 0,Product_Category,Sales_Amount,Transaction_Date
0,Electronics,100.5,2022-01-15
1,Clothing,50.25,2022-01-20
2,Electronics,200.75,2022-04-10
3,Clothing,40.0,2022-07-05
4,Electronics,300.0,2022-10-30


In [16]:
def quarterly_sales_analysis(sales_data):
    sales_data['Quarter'] = sales_data['Transaction_Date'].astype('datetime64[ns]').dt.quarter
    quarterly_sales_total = sales_data.groupby(['Product_Category', 'Quarter'], as_index=False)['Sales_Amount'].sum()
    quarterly_sales_avg = sales_data.groupby(['Product_Category', 'Quarter'], as_index=False)['Sales_Amount'].mean()
    quarterly_sales_data = pd.merge(quarterly_sales_total, quarterly_sales_avg, on=['Product_Category', 'Quarter'])
    quarterly_sales_data.columns = ['Product_Category', 'Quarter', 'sales_total', 'sales_avg']
    return quarterly_sales_data

In [17]:
quarterly_sales_analysis(data)

Unnamed: 0,Product_Category,Quarter,sales_total,sales_avg
0,Clothing,1,50.25,50.25
1,Clothing,3,40.0,40.0
2,Electronics,1,100.5,100.5
3,Electronics,2,200.75,200.75
4,Electronics,4,300.0,300.0


## Objective:
### Background:
You are a data scientist working for an e-commerce company. The company wants to understand the sales trends for various products across different regions. The data consists of sales records, and your task is to write a Python function that calculates the average sales for each product category in each region, along with the percentage change compared to the previous month.

### Question:
Write a Python function `calculate_sales_trends(data: pd.DataFrame) -> pd.DataFrame` that takes in a Pandas DataFrame containing sales data and returns a new DataFrame containing the average sales for each product category in each region, along with the percentage change compared to the previous month.

### Inputs:
- A Pandas DataFrame `data` with the following columns:
  - `date`: str, the date of the sale in the format "YYYY-MM-DD".
  - `region`: str, the region where the sale occurred.
  - `product_category`: str, the category of the product sold.
  - `sales`: float, the sales amount in dollars.

### Outputs:
- A Pandas DataFrame with the following columns:
  - `region`: str, the region.
  - `product_category`: str, the product category.
  - `average_sales`: float, the average sales for that product category in the region.
  - `percentage_change`: float, the percentage change in average sales compared to the previous month.

### Libraries Needed:
- pandas

## Data:
Here you'll find the sales data in JSON format:

```json
[
  {"date": "2023-01-01", "region": "North", "product_category": "Electronics", "sales": 200.50},
  {"date": "2023-01-02", "region": "South", "product_category": "Furniture", "sales": 150.75},
  {"date": "2023-01-03", "region": "North", "product_category": "Electronics", "sales": 220.30},
  {"date": "2023-02-01", "region": "North", "product_category": "Electronics", "sales": 210.40},
  {"date": "2023-02-02", "region": "South", "product_category": "Furniture", "sales": 160.25},
  {"date": "2023-02-03", "region": "North", "product_category": "Electronics", "sales": 230.50},
  {"date": "2023-03-01", "region": "North", "product_category": "Electronics", "sales": 205.60},
  {"date": "2023-03-02", "region": "South", "product_category": "Furniture", "sales": 170.80},
  {"date": "2023-03-03", "region": "North", "product_category": "Electronics", "sales": 240.00},
]


In [2]:
import pandas as pd

In [3]:
data = pd.DataFrame([
  {"date": "2023-01-01", "region": "North", "product_category": "Electronics", "sales": 200.50},
  {"date": "2023-01-02", "region": "South", "product_category": "Furniture", "sales": 150.75},
  {"date": "2023-01-03", "region": "North", "product_category": "Electronics", "sales": 220.30},
  {"date": "2023-02-01", "region": "North", "product_category": "Electronics", "sales": 210.40},
  {"date": "2023-02-02", "region": "South", "product_category": "Furniture", "sales": 160.25},
  {"date": "2023-02-03", "region": "North", "product_category": "Electronics", "sales": 230.50},
  {"date": "2023-03-01", "region": "North", "product_category": "Electronics", "sales": 205.60},
  {"date": "2023-03-02", "region": "South", "product_category": "Furniture", "sales": 170.80},
  {"date": "2023-03-03", "region": "North", "product_category": "Electronics", "sales": 240.00},
])

In [4]:
data

Unnamed: 0,date,region,product_category,sales
0,2023-01-01,North,Electronics,200.5
1,2023-01-02,South,Furniture,150.75
2,2023-01-03,North,Electronics,220.3
3,2023-02-01,North,Electronics,210.4
4,2023-02-02,South,Furniture,160.25
5,2023-02-03,North,Electronics,230.5
6,2023-03-01,North,Electronics,205.6
7,2023-03-02,South,Furniture,170.8
8,2023-03-03,North,Electronics,240.0


In [27]:
def calculate_sales_trends(data):
    trends = data.groupby(['region', 'product_category'],as_index=False).agg({
        'sales': ['sum', 'mean'],
        'date': ['min']
    })


    # Renaming columns for clarity
    trends.columns = ['region', 'product_category', 'total_sales', 'average_sales', 'min_date']

    return trends


In [28]:
trends = calculate_sales_trends(data)
trends

Unnamed: 0,region,product_category,total_sales,average_sales,min_date
0,North,Electronics,1307.3,217.883333,2023-01-01
1,South,Furniture,481.8,160.6,2023-01-02


### The .agg() accepts a dictionary. the key is the column we'd like to apply the aggregation to, and the value is the aggregation function.
- this is incredibly helpful for me to know, as I can make multiple columns using one groupby method. Especially since multiple columns and methods can be utilized.

## Objective:
### Background:
As a data scientist working for a retail company, you are given a dataset containing information about customer purchases. You need to derive insights and perform exploratory data analysis to identify the top 5 products in terms of sales and evaluate the correlation between customer age and their spending on these products.

### Question:
Write a Python function `top_products_analysis(data: pd.DataFrame) -> Tuple[List[str], float]` that takes a Pandas DataFrame containing customer purchase information and returns a tuple. The first element of the tuple is a list of the top 5 products by sales, and the second element is the Pearson correlation coefficient between the customer's age and their spending on these top 5 products.

### Inputs:
- `data`: A Pandas DataFrame with columns:
  - `product_name` (str): The name of the product.
  - `sales` (float): The sales amount for that product.
  - `customer_age` (int): The age of the customer who made the purchase.
  - `quantity` (int): The quantity of the product sold.

### Outputs:
- A tuple containing:
  - A list of strings representing the names of the top 5 products by sales.
  - A float representing the Pearson correlation coefficient between customer age and their spending on these top 5 products. The result should be rounded to 4 decimal places.

### Libraries Needed:
- pandas
- numpy


In [1]:
import pandas as pd
import numpy as np

data = pd.DataFrame(
{
  "product_name": ["Laptop", "Mobile", "TV", "Headphones", "Tablet", "Chair", "Table", "Sofa", "Refrigerator", "Washing Machine"],
  "sales": [1200.50, 800.75, 600.30, 150.20, 300.40, 100.10, 200.30, 400.50, 700.60, 500.40],
  "customer_age": [25, 30, 35, 20, 40, 22, 32, 38, 26, 27],
  "quantity": [10, 20, 15, 5, 8, 3, 4, 6, 7, 9]
})

In [2]:
data

Unnamed: 0,product_name,sales,customer_age,quantity
0,Laptop,1200.5,25,10
1,Mobile,800.75,30,20
2,TV,600.3,35,15
3,Headphones,150.2,20,5
4,Tablet,300.4,40,8
5,Chair,100.1,22,3
6,Table,200.3,32,4
7,Sofa,400.5,38,6
8,Refrigerator,700.6,26,7
9,Washing Machine,500.4,27,9


In [9]:
def top_products_analysis(dataframe):
    grouped_data = dataframe.groupby('product_name')[['product_name', 'sales']].sum().sort_values('sales', ascending=False).head(5)
    top_5_products = grouped_data.index.tolist()

    #find correlation of age and sales
    correlation = dataframe[['sales', 'customer_age']].corr(method='pearson')

    return (top_5_products, correlation)

In [10]:
top_products_analysis(data)

(['Laptop', 'Mobile', 'Refrigerator', 'TV', 'Washing Machine'],
                  sales  customer_age
 sales         1.000000     -0.025281
 customer_age -0.025281      1.000000)

## Objective:
### Background:
As a data scientist at a retail company, you are given a dataset containing sales data for various products. Your task is to analyze the data to find the average sales for each product category and identify the top 3 products with the highest sales in each category.

### Question:
Write a function named `analyze_sales` that takes in a pandas DataFrame containing sales data. The DataFrame will have the following columns: `product_id`, `category`, `product_name`, and `sales`. Your function should return a dictionary containing two items:
1. 'average_sales': A dictionary containing the average sales for each product category.
2. 'top_products': A dictionary containing the top 3 products with the highest sales for each category.

### Inputs:
- A pandas DataFrame named `sales_data` with the following columns:
    - `product_id` (int): The unique identifier for the product.
    - `category` (string): The category of the product.
    - `product_name` (string): The name of the product.
    - `sales` (float): The sales amount for the product.

### Expected Outputs:
- A dictionary with two keys:
    - 'average_sales': A dictionary containing the average sales for each product category.
    - 'top_products': A dictionary containing the top 3 products with the highest sales for each category. Each entry should contain a list of product names.

### Libraries Needed:
- pandas


In [1]:
import pandas as pd

data = pd.DataFrame({
  "product_id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
  "category": ["Electronics", "Electronics", "Clothing", "Clothing", "Clothing", "Food", "Food", "Food", "Books", "Books"],
  "product_name": ["Laptop", "Smartphone", "Shirt", "Pants", "Jacket", "Bread", "Milk", "Cheese", "Novel", "Comics"],
  "sales": [1200.50, 800.30, 300.40, 150.20, 200.10, 100.60, 50.70, 40.30, 70.80, 50.90]
})
data

Unnamed: 0,product_id,category,product_name,sales
0,1,Electronics,Laptop,1200.5
1,2,Electronics,Smartphone,800.3
2,3,Clothing,Shirt,300.4
3,4,Clothing,Pants,150.2
4,5,Clothing,Jacket,200.1
5,6,Food,Bread,100.6
6,7,Food,Milk,50.7
7,8,Food,Cheese,40.3
8,9,Books,Novel,70.8
9,10,Books,Comics,50.9


In [3]:
def analyze_sales(datafrane):
    avg_sales = datafrane.groupby('category').agg({'sales': 'mean'})
    top_products = avg_sales.sort_values('sales', ascending=False).head(3)
    top_products_dict = top_products.to_dict()
    return avg_sales, top_products_dict

avg_sales, top_products_dict = analyze_sales(data)

In [4]:
avg_sales

Unnamed: 0_level_0,sales
category,Unnamed: 1_level_1
Books,60.85
Clothing,216.9
Electronics,1000.4
Food,63.866667


In [5]:
top_products_dict

{'sales': {'Electronics': 1000.4,
  'Clothing': 216.89999999999998,
  'Food': 63.86666666666667}}

### Objective:
You are working as a data scientist at a technology company and are tasked with analyzing user engagement data. Your goal is to write a Python function that will identify the top 5 engaged users for a given week and calculate the mean, median, and standard deviation of their engagement scores.

### Question:
Write a function `top_users_statistics(data: pd.DataFrame) -> Tuple[float, float, float]` that takes a DataFrame containing user engagement data and returns a tuple containing the mean, median, and standard deviation of the engagement scores of the top 5 users for the week.

### Libraries Needed:
- pandas
- numpy

### Inputs:
- `data`: A pandas DataFrame containing the user engagement data. The DataFrame has the following columns:
  - `user_id`: An integer representing the unique user ID.
  - `engagement_score`: A float representing the engagement score of the user for the week.

### Expected Outputs:
- A tuple containing three float values representing the mean, median, and standard deviation of the engagement scores of the top 5 users for the week.


In [2]:
import pandas as pd

In [5]:
### Data:
data = pd.DataFrame({
  "user_id": [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
  "engagement_score": [52.1, 45.5, 67.8, 23.4, 89.2, 34.5, 76.3, 54.8, 32.6, 18.9]
})


def top_users_statistics(data):
    engagement_mean = round(data['engagement_score'].mean(),2)
    engagement_median = round(data['engagement_score'].median(),2)
    engagement_std = round(data['engagement_score'].std(),4)
    return (engagement_mean, engagement_median, engagement_std)
results = top_users_statistics(data)
results


(49.51, 48.8, 23.1138)

Background:
You are a data scientist at a retail company and are tasked with analyzing the sales data of different products across various stores. Your goal is to identify the top-selling products in each store and provide a summary statistic.

Question:
Write a Python function named `top_selling_products` that takes a pandas DataFrame as an input, containing sales data, and returns a new DataFrame. This new DataFrame should contain the top 3 selling products for each store based on the 'quantity_sold' column.

Inputs:
- `sales_data`: A pandas DataFrame containing the sales data. The DataFrame will have the following columns:
  - 'store_id': an integer representing the store's ID.
  - 'product_id': an integer representing the product's ID.
  - 'product_name': a string representing the product's name.
  - 'quantity_sold': an integer representing the quantity sold.

Expected Outputs:
- A pandas DataFrame containing the top 3 selling products for each store. The DataFrame should have the following columns:
  - 'store_id': an integer representing the store's ID.
  - 'product_name': a string representing the top 3 selling products' names.
  - 'quantity_sold': an integer representing the quantity sold for the top 3 selling products.

Libraries Needed:
- pandas


In [1]:
import pandas as pd

data = pd.DataFrame({
  "store_id": [1, 1, 1, 1, 2, 2, 2, 2, 2],
  "product_id": [101, 102, 103, 104, 201, 202, 203, 204, 205],
  "product_name": ["Apple", "Banana", "Cherry", "Date", "Eggplant", "Fig", "Grape", "Honeydew", "Indigo Berry"],
  "quantity_sold": [50, 30, 40, 10, 20, 60, 55, 15, 5]
}
)

In [5]:
def top_three(data):
    return data.sort_values('quantity_sold', ascending=False)


In [6]:
top_three = top_three(data)
top_three

Unnamed: 0,store_id,product_id,product_name,quantity_sold
5,2,202,Fig,60
6,2,203,Grape,55
0,1,101,Apple,50
2,1,103,Cherry,40
1,1,102,Banana,30
4,2,201,Eggplant,20
7,2,204,Honeydew,15
3,1,104,Date,10
8,2,205,Indigo Berry,5
