# Task Overview: Sales Forecasting Process
#### You are working with a dataset of product sales. Your task is to clean and prepare the data, build a model to predict sales, and evaluate the model's performance.

### Task 1: Data Loading and Cleaning (Pandas)
-  Load the dataset: Create a DataFrame from the following dictionary representing sales data:
- Clean the data: 
    - Impute missing values: 
        - Use SimpleImputer from sklearn to fill missing values:
            - Replace missing values in "Units_Sold" and "Price_Per_Unit" with the mean of the respective column.
            - Ensure all missing values are handled, and print the cleaned DataFrame.



In [4]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split


data = {
    'Product_ID': [101, 102, 103, 104, 105],
    'Units_Sold': [15, np.nan, 20, 17, 30],
    'Price_Per_Unit': [100, 120, 150, np.nan, 130],
    'Discount_Applied': [0, 10, 5, 0, 15],
    'Store_Location': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Chicago']
}

## Loading Dataset into a DataFrame.
df = pd.DataFrame(data)

## Replacing missing values.

imputer = SimpleImputer(missing_values = np.nan, strategy='mean')
Imputed_Units_Sold = imputer.fit_transform(df[['Units_Sold']])
df['Units_Sold'] = Imputed_Units_Sold


Imputed_Price_Per_Unit = imputer.fit_transform(df[['Price_Per_Unit']])
df['Price_Per_Unit'] = Imputed_Price_Per_Unit

## Printing clean DataFrame.

print(df)


   Product_ID  Units_Sold  Price_Per_Unit  Discount_Applied Store_Location
0         101        15.0           100.0                 0       New York
1         102        20.5           120.0                10    Los Angeles
2         103        20.0           150.0                 5       New York
3         104        17.0           125.0                 0        Chicago
4         105        30.0           130.0                15        Chicago


### Task 2: Feature Engineering (Pandas and NumPy)
#### Create new features:
- Add a new column Revenue, which is calculated as: Revenue = (Units_Sold * Price_Per_Unit) - Discount_Applied
- Categorize locations:
    - Create a new column called Store_Category based on Store_Location:
        - If the store is in New York, label it "Tier 1", otherwise label it "Tier 2".
        - Print the updated DataFrame.


In [6]:
## Adding Revenue Column.
df['Revenue'] = df['Units_Sold']*df['Price_Per_Unit']-df['Discount_Applied']

## Adding Store_Category Column.
df['Store_Category'] = np.where(df['Store_Location'] == 'New York', 'Tier 1', 'Tier 2')

## Printing Updated df.
print(df)

   Product_ID  Units_Sold  Price_Per_Unit  Discount_Applied Store_Location  \
0         101        15.0           100.0                 0       New York   
1         102        20.5           120.0                10    Los Angeles   
2         103        20.0           150.0                 5       New York   
3         104        17.0           125.0                 0        Chicago   
4         105        30.0           130.0                15        Chicago   

   Revenue Store_Category  
0   1500.0         Tier 1  
1   2450.0         Tier 2  
2   2995.0         Tier 1  
3   2125.0         Tier 2  
4   3885.0         Tier 2  


### Task 3: Modeling (scikit-learn)
#### Prepare data for modeling:
- Use the columns "Units_Sold", "Price_Per_Unit", and "Discount_Applied" as the features (X).
- Use the Revenue column as the target variable (y).
- Split the data into a training and test set (80/20 split)
- Create and fit a LinearRegression model using the training data.
- Evaluate the model:
    - Print the model's R² score on the test set.
    - Predict the revenue for a sample input where "Units_Sold" = 25, "Price_Per_Unit" = 120, and "Discount_Applied" = 10.
    
    
##### (Note: Logic is correct but because the data set is so small it's unreliable to treat it in such a way with a train and test split. For the sake of practice continue anyways, but keep in mind that when handling small data sets cross validation or training on the whole sample is better.)

In [9]:
# Determining variables.
X = df[['Units_Sold', 'Price_Per_Unit', 'Discount_Applied']]
y = df['Revenue']

# Splitting into train and test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Creating and fitting LinearRegression Model.
lr = LinearRegression()
lr.fit(X_train, y_train)

# Evaluating the Model.
r2_score = lr.score(X_test, y_test)
print(f'R2 of LR model on the test set = {r2_score:.4f}')
      
pred = lr.predict([[25, 120, 10]])
print(f'Predicting Revenue where Units_Sold = 25, Price_Per_Unit = 120, and Discount_Applied = 10: {pred}')

R2 of LR model on the test set = nan
Predicting Revenue where Units_Sold = 25, Price_Per_Unit = 120, and Discount_Applied = 10: [3090.]




### Task 4: Data Summary (Pandas)
- Summary statistics:
    - Get and print the mean, median, and standard deviation of the Revenue column using Pandas methods.

In [11]:
print(df['Revenue'].describe())

count       5.000000
mean     2591.000000
std       903.461399
min      1500.000000
25%      2125.000000
50%      2450.000000
75%      2995.000000
max      3885.000000
Name: Revenue, dtype: float64
