In [1]:
from google.colab import files
uploaded = files.upload()

Saving Advertising.csv to Advertising.csv


## Save and download the model

### Subtask:
Save the trained model and download it for offline use.

**Reasoning**:
Use `joblib` to save the trained model to a file and then use `files.download` to download the saved model file.

In [8]:
import joblib
from google.colab import files

# Save the model
filename = 'linear_regression_model.pkl'
joblib.dump(model, filename)

# Download the model
files.download(filename)

print(f"Model saved as {filename} and downloaded.")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Model saved as linear_regression_model.pkl and downloaded.


# Task
Load the dataset "Advertising.csv", train a machine learning model to predict sales based on advertising expenditures, and evaluate the model's performance.

## Load the data

### Subtask:
Load the uploaded data into a pandas DataFrame.


**Reasoning**:
The first step is to load the data into a pandas DataFrame for further processing. This involves importing the pandas library and reading the CSV content from the uploaded dictionary.



In [2]:
import pandas as pd
import io

df = pd.read_csv(io.BytesIO(uploaded['Advertising.csv']))
display(df.head())

Unnamed: 0.1,Unnamed: 0,TV,Radio,Newspaper,Sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


## Explore and preprocess the data

### Subtask:
Analyze the data to understand its structure, check for missing values, and perform any necessary preprocessing steps like feature scaling or encoding.


**Reasoning**:
Analyze the data by displaying column information, checking for missing values, calculating descriptive statistics, and dropping the unnecessary index column.



In [3]:
df.info()
print("\nMissing values per column:")
print(df.isnull().sum())
print("\nDescriptive statistics:")
display(df.describe())
if 'Unnamed: 0' in df.columns:
    df = df.drop('Unnamed: 0', axis=1)
    print("\nDropped 'Unnamed: 0' column.")
display(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  200 non-null    int64  
 1   TV          200 non-null    float64
 2   Radio       200 non-null    float64
 3   Newspaper   200 non-null    float64
 4   Sales       200 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB

Missing values per column:
Unnamed: 0    0
TV            0
Radio         0
Newspaper     0
Sales         0
dtype: int64

Descriptive statistics:


Unnamed: 0.1,Unnamed: 0,TV,Radio,Newspaper,Sales
count,200.0,200.0,200.0,200.0,200.0
mean,100.5,147.0425,23.264,30.554,14.0225
std,57.879185,85.854236,14.846809,21.778621,5.217457
min,1.0,0.7,0.0,0.3,1.6
25%,50.75,74.375,9.975,12.75,10.375
50%,100.5,149.75,22.9,25.75,12.9
75%,150.25,218.825,36.525,45.1,17.4
max,200.0,296.4,49.6,114.0,27.0



Dropped 'Unnamed: 0' column.


Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


## Split the data

### Subtask:
Divide the dataset into training and testing sets.


**Reasoning**:
Split the data into training and testing sets using `train_test_split` based on the features 'TV', 'Radio', and 'Newspaper' and the target 'Sales'.



In [4]:
from sklearn.model_selection import train_test_split

X = df[['TV', 'Radio', 'Newspaper']]
y = df['Sales']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (160, 3)
Shape of X_test: (40, 3)
Shape of y_train: (160,)
Shape of y_test: (40,)


## Choose and train a model

### Subtask:
Select an appropriate machine learning model for regression (since we are predicting sales) and train it on the training data.


**Reasoning**:
Import the LinearRegression model, instantiate it, and train it using the training data.



In [5]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

## Evaluate the model

### Subtask:
Assess the performance of the trained model using appropriate metrics on the testing data.


**Reasoning**:
Assess the performance of the trained model using appropriate metrics on the testing data.



In [6]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2) Score: {r2:.2f}")

Mean Squared Error (MSE): 3.17
R-squared (R2) Score: 0.90


## Make predictions

### Subtask:
Use the trained model to make predictions on new data.


**Reasoning**:
Use the trained model to make predictions on the test set and display the first few predictions along with the actual values for comparison.



In [7]:
y_pred = model.predict(X_test)

print("First 5 Predicted Sales:")
display(y_pred[:5])

print("\nFirst 5 Actual Sales:")
display(y_test[:5])

First 5 Predicted Sales:


array([16.4080242 , 20.88988209, 21.55384318, 10.60850256, 22.11237326])


First 5 Actual Sales:


Unnamed: 0,Sales
95,16.9
15,22.4
30,21.4
158,7.3
128,24.7


## Summary:

### Data Analysis Key Findings

*   The dataset contains 200 entries with no missing values.
*   The 'Unnamed: 0' column, identified as an index, was successfully dropped.
*   The dataset was split into training (160 samples) and testing (40 samples) sets, with a test size of 20%.
*   A Linear Regression model was trained on the training data.
*   The model achieved a Mean Squared Error (MSE) of 3.17 on the test set.
*   The model achieved an R-squared (R2) score of 0.90 on the test set, indicating that approximately 90% of the variance in sales can be explained by the advertising expenditures.

### Insights or Next Steps

*   The linear regression model appears to be a good fit for predicting sales based on the provided advertising expenditures, as indicated by the high R-squared score.
*   Further analysis could explore the individual impact of each advertising channel (TV, Radio, Newspaper) on sales based on the model coefficients.
