# Advertising Budgets-Sales Challenge

In this challenge, you'll explore a real-world dataset containing Sales (in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media.

The adverstising dataset captures sales revenue generated with respect to advertisement spends across multiple channels like radio, tv and newspaper.

- **TV**: Spend on TV Advertisements
- **Radio**: Spend on radio Advertisements
- **Newspaper**: Spend on newspaper Advertisements
- **Sales**: Sales revenue generated


In [2]:
import pandas as pd
import matplotlib.pyplot as plt

df_advertisement = pd.read_csv('advertising.csv')


The challenge is to **explore the dataset to analyze and identify which media contribute to sales and to find a function that given input budgets for TV, radio and newspaper predicts the output sales.**

1. Start by cleaning the data.
   R - Identify any null or missing data, and impute appropriate replacement values.
   R - Describe and identify statistical parameters for each column.

2. Determine the relationship between the advertising budgets and sales, and to build a predictive model that can estimate sales based on the given budgets for TV, radio, and newspaper.
   **Exploratory Data Analysis (EDA)**: 
   * Describe and visualize the data to understand the distribution and relationships between variables
   * Calculate and plot heatmap correlation and pairwise correlations
   **Feature Engineering**: 
   * Create any additional features that might help in the analysis. Will be explained later.
   * Splitting data into training and test datasets. 
   * Train_Set_Size need to be 90% and Test_Set_Size 10%.
3. Apply any machine learning algorithm on the dataset   
   * Load the algorithm
   * Instantiate and Fit the model to the training dataset
   * Prediction on the test set
   * Evaluate with 3 different metrics. 
4. Create a new feature called Area, and randomly assign observations to be rural, suburban, or urban, this variable need to have gaussian distribution.
   * Plot the new data distribution according to the new feature.
   * Transform feature to numerical. Create additional dummy binary variables that describe the feature:
      - rural is coded as Area_suburban = 0 and Area_urban = 0
      - suburban is coded as Area_suburban = 1 and Area_urban = 0
      - urban is coded as Area_suburban = 0 and Area_urban = 1
   * Apply Number 3 step again with this dataset.
4. Answer next questions:
- Is there a relationship between sales and spend various advertising channels?
- Which is the channel with more relationship with sales?
- Which is the model that describe both problems?
- Which is the best channel to increase sales?
- Which is the worst channel to increase sales?



1.- what do we have here




In [3]:
df_advertisement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  200 non-null    int64  
 1   TV          200 non-null    float64
 2   radio       200 non-null    float64
 3   newspaper   200 non-null    float64
 4   sales       200 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB


In [4]:
# pd.set_option("max_rows", None)
pd.set_option('display.max_rows', None) #muestra todo 
df_advertisement


Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9
5,6,8.7,48.9,75.0,7.2
6,7,57.5,32.8,23.5,11.8
7,8,120.2,19.6,11.6,13.2
8,9,8.6,2.1,1.0,4.8
9,10,199.8,2.6,21.2,10.6


In [5]:
null_values = df_advertisement.isnull().sum()
null_values

Unnamed: 0    0
TV            0
radio         0
newspaper     0
sales         0
dtype: int64

- estamos trabajando con una columna de id t datos tipo flotante
- viendo el CSV parece que tenemos dos columnas de ID, unnamed y la que genera automáticamente pandas,  
podemos eliminar una 
- analizando en busqueda de valores nulos no exixte ninguno en las columnas
- viendo el dataframe no se ve algun número extraño

analizemos valores máximos y mínimos de columnas


In [6]:
df_advertisement.max()

Unnamed: 0    200.0
TV            296.4
radio          49.6
newspaper     114.0
sales          27.0
dtype: float64

In [18]:
df_advertisement.min()

Unnamed: 0    1.0
TV            0.7
radio         0.3
newspaper     0.3
sales         1.6
dtype: float64

encontramos un valor atípico en la columna radio de 0.0, podemos eliminar el row 

In [17]:
df_advertisement.loc[df_advertisement['radio'] == 0.0]


Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales


In [21]:
# hacemos un drop de la fila con df_advertisement.drop([127])
# y analizamos valores de nuevo
df_advertisement.min()


Unnamed: 0    1.0
TV            0.7
radio         0.3
newspaper     0.3
sales         1.6
dtype: float64

borraremos la columna extra de ID


In [22]:
df_advertisement.drop(columns=["Unnamed: 0"])

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9
5,8.7,48.9,75.0,7.2
6,57.5,32.8,23.5,11.8
7,120.2,19.6,11.6,13.2
8,8.6,2.1,1.0,4.8
9,199.8,2.6,21.2,10.6


los datos parecen limpios
utilizaremos el método .describe para sacar parámetros estadísticos
Promedio (mean)
Desviacion estandard (std)
Valor minimo
Valor maximo
Cuartiles (25%, 50% y 75%)



In [23]:
df_advertisement.describe()

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
count,199.0,199.0,199.0,199.0,199.0
mean,100.361809,147.378392,23.380905,30.661307,14.048744
std,57.992073,85.938922,14.791683,21.780479,5.217365
min,1.0,0.7,0.3,0.3,1.6
25%,50.5,74.05,10.05,12.85,10.4
50%,100.0,149.8,23.3,25.9,12.9
75%,150.5,219.15,36.55,45.1,17.4
max,200.0,296.4,49.6,114.0,27.0
