# Predictive Modeling of Television Subscription Rates Using Regression Analysis
## Students:
* Vremăroiu Andrei Florin
* Tudor Alexandru Panait

## Objectives
**Data Analysis:** The project will analyze historical data from television transmitters, stations, inflation rates, and subscription figures.

**Model Development:** We aim to develop regression models to predict television subscription rates using variables like transmitter and station numbers, and inflation rates.

**Model Evaluation:** The models' effectiveness will be assessed using metrics like Mean Squared Error (MSE) and Mean Absolute Error (MAE).

**Hyperparameter Optimization:** This step will fine-tune the Random Forest Regressor's hyperparameters to enhance prediction accuracy.

**Business Insights:** The project seeks to uncover insights into the factors affecting television subscription rates and their business implications in the telecom industry.

> *Please run all cells and upload the dataset from the folder. Thank you!*



In [None]:
from google.colab import files
uploaded = files.upload()

Saving translatoareteleviziune.csv to translatoareteleviziune.csv
Saving statiiteleviziune.csv to statiiteleviziune.csv
Saving ratainflatiei.csv to ratainflatiei.csv
Saving program.py to program.py
Saving abonamenteteleviziune.csv to abonamenteteleviziune.csv


## Importing Necessary Libraries

In this section, we import several Python libraries that are essential for data handling, model building, and performance evaluation:

- **pandas**: Used for data manipulation and analysis. It provides data structures and operations for manipulating numerical tables and time series, which is fundamental for handling our datasets.

- **scikit-learn (sklearn)**: This library is crucial for various stages of the machine learning pipeline:
  - `train_test_split`: Helps in splitting the data into training and testing sets, which is necessary for training our models and evaluating their performance on unseen data.
  - `LinearRegression`: Provides the implementation of the linear regression model, which we use as one of our predictive models to forecast television subscription rates.
  - `mean_squared_error`, `mean_absolute_error`: These functions allow us to calculate the Mean Squared Error (MSE) and Mean Absolute Error (MAE) of our models, which are key metrics for evaluating the accuracy of our predictions.
  - `RandomForestRegressor`: An ensemble learning method based on randomized decision trees, known for its high accuracy in regression tasks. We use it to build a more complex model that can potentially capture nonlinear dependencies in the data.
  - `GridSearchCV`: A tool for tuning model parameters (hyperparameters) to find the most effective model settings. It automates the process of finding the best parameters for the models, enhancing their performance by optimizing hyperparameter settings.

These libraries and their specific modules provide the tools needed to carry out each phase of our project, from data preparation to complex model training and evaluation.


In [None]:
# Importing necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt

In [None]:
# Loading datasets
translatoare_df = pd.read_csv('translatoareteleviziune.csv')
statii_df = pd.read_csv('statiiteleviziune.csv')
inflatie_df = pd.read_csv('ratainflatiei.csv')
abonamente_df = pd.read_csv('abonamenteteleviziune.csv')

## Data Preparation

### Cleaning Column Names
To ensure consistency and prevent errors during data manipulation, we first clean the column names across all datasets:
- **Stripping whitespace**: We remove any leading or trailing spaces from the column names using `str.strip()` method. This is crucial to avoid errors in referencing column names that might inadvertently include spaces.
- **Column selection**: For the `translatoare_df`, we only retain the first two columns as they contain the relevant data needed for our analysis.

### Merging Datasets
To consolidate our data for analysis, we merge the datasets on the Anul column, which represents the year:

- **Merging strategy**: We use an inner join to ensure that we only keep records that have data across all datasets for the same year.
- **Handling suffixes**: To differentiate columns with the same names but from different datasets, we add suffixes to the column names (_translatoare, _statii, _inflatie, _abonamente).

### Sorting Data
To maintain the temporal order, which is crucial for any time series analysis or any study where trends over time are relevant, we sort the data by the Anul column

### Splitting Data into Training and Testing Sets
To evaluate the performance of our predictive models, we split our dataset into training and testing sets:

- **Training set**: Contains 80% of the data, used to train the models.
- **Testing set**: Comprises the remaining 20%, used to test the model's predictive performance.
- **Feature and target separation**: We separate the features (X) from the target variable (y), which in this case is Rata valoare, representing the subscription rate or value.

In [None]:
# Cleaning column names
translatoare_df.columns = translatoare_df.columns.str.strip()
translatoare_df = translatoare_df.iloc[:, :2]
inflatie_df.columns = inflatie_df.columns.str.strip()
abonamente_df.columns = abonamente_df.columns.str.strip()
statii_df.columns = statii_df.columns.str.strip()

# Merging datasets based on 'Anul' column
merged_df = pd.merge(translatoare_df, statii_df, on='Anul', suffixes=('_translatoare', '_statii'))
merged_df = pd.merge(merged_df, inflatie_df, on='Anul')
merged_df = pd.merge(merged_df, abonamente_df, on='Anul', suffixes=('_inflatie', '_abonamente'))

# Sorting DataFrame by 'Anul' column to maintain temporal order
merged_df = merged_df.sort_values(by='Anul')

# Splitting the data into training and testing sets
train_index = int(0.8 * len(merged_df))
X_train = merged_df.iloc[:train_index].drop('Rata valoare', axis=1)
y_train = merged_df.iloc[:train_index]['Rata valoare']
X_test = merged_df.iloc[train_index:].drop('Rata valoare', axis=1)
y_test = merged_df.iloc[train_index:]['Rata valoare']

## Model Development and Evaluation

### Training the Linear Regression Model
In this step, we employ the Linear Regression algorithm to develop our predictive model. This model will help us understand the relationship between the input features and the target variable, which in our case is the subscription rate.

## Predicting and Evaluating the Model
After training the model, we use it to predict subscription rates on the testing set. To assess the accuracy of our model's predictions, we calculate the Mean Squared Error (MSE) and Mean Absolute Error (MAE) between the predicted values and the actual values:

- **Mean Squared Error (MSE)**: This metric measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. A lower MSE indicates a better fit of the model to the data.
- **Mean Absolute Error (MAE)**: This metric measures the average magnitude of the errors in a set of predictions, without considering their direction. It's the average over the test sample of the absolute differences between prediction and actual observation where all individual differences are weighted equally.

These evaluation metrics provide us with insights into the model's performance, indicating how well our model can forecast television subscription rates based on the given features. By examining these errors, we can gauge the accuracy and reliability of our predictive model.

In [None]:
# Training a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Mean Absolute Error:", mae)

Mean Squared Error: 0.1336082006830817
Mean Absolute Error: 0.27631422442122666


## Advanced Model Development and Evaluation

### Training the Random Forest Regressor with Hyperparameter Optimization
To enhance our predictive accuracy, we deploy the Random Forest Regressor, an ensemble learning method known for its robustness and higher accuracy in handling complex datasets with nonlinear relationships. To optimize the model, we employ GridSearchCV to systematically explore a range of hyperparameters, aiming to find the combination that yields the best prediction results.

### Evaluating the Optimized Random Forest Model
After determining the best hyperparameters, we use the optimized Random Forest model to make predictions on the test dataset. We then evaluate the model's accuracy using the same metrics as before: Mean Squared Error (MSE) and Mean Absolute Error (MAE). These metrics will provide a comparative insight into how the Random Forest model performs against the simpler Linear Regression model, highlighting improvements in prediction accuracy and model robustness.

This section not only demonstrates the implementation of a more complex model but also emphasizes the importance of hyperparameter tuning in achieving the best possible outcomes from sophisticated machine learning algorithms.

In [None]:
# Training a Random Forest Regressor model with hyperparameter optimization
rf_model = RandomForestRegressor()
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
print("Best Hyperparameters:", grid_search.best_params_)
y_pred_rf = grid_search.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
print("Mean Squared Error (Random Forest):", mse_rf)
print("Mean Absolute Error (Random Forest):", mae_rf)

Fitting 3 folds for each of 81 candidates, totalling 243 fits
Best Hyperparameters: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Mean Squared Error (Random Forest): 3.5191628509339474
Mean Absolute Error (Random Forest): 1.7783228870870849


## Gradient Boosting Model Training and Evaluation

### Training the Gradient Boosting Model
Gradient Boosting is a powerful and widely-used machine learning technique that builds on decision trees. Here, we train a Gradient Boosting Regressor which optimizes for least squares regression. The model parameters include:
- `n_estimators=100`: The number of boosting stages to be run. More stages can lead to better performance but also to overfitting.
- `max_depth=5`: The maximum depth of the individual regression estimators. This controls the complexity and performance of the model.

### Evaluating the Gradient Boosting Model
After training, we predict the television subscription rates using our test set and evaluate the model's performance using Mean Squared Error (MSE) and Mean Absolute Error (MAE) to measure accuracy.


In [1]:
# Training a Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=100, max_depth=5)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
mse_gb = mean_squared_error(y_test, y_pred_gb)
mae_gb = mean_absolute_error(y_test, y_pred_gb)

# Output MSE and MAE for Gradient Boosting model
print("Gradient Boosting Model Performance:")
print("Mean Squared Error (MSE):", mse_gb)
print("Mean Absolute Error (MAE):", mae_gb)

NameError: name 'GradientBoostingRegressor' is not defined

## Training the XGBoost Model
XGBoost (Extreme Gradient Boosting) is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting that solves many data science problems in a fast and accurate way. The key parameters are:

- `objective=reg:squarederror`: Specifies the learning task and the corresponding learning objective.
- `n_estimators=100`: Number of gradient boosted trees. Equivalent to the number of boosting rounds.
- `learning_rate=0.1`: Boosting learning rate (xgb's "eta")
- `max_depth=5`: Maximum depth of a tree. Increasing this value will make the model more complex and likely more likely to overfit.

### Evaluating the XGBoost Model
Similar to the Gradient Boosting model, we evaluate the XGBoost model's performance on the test data using MSE and MAE to understand its accuracy in predicting subscription rates.

In [None]:
# Training an XGBoost model
xgb_model = xgb.XGBRegressor(objective ='reg:squarederror', n_estimators=100, learning_rate=0.1, max_depth=5)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)

# Output MSE and MAE for XGBoost model
print("XGBoost Model Performance:")
print("Mean Squared Error (MSE):", mse_xgb)
print("Mean Absolute Error (MAE):", mae_xgb)

## Model Performance Comparison and Visualization

### Storing Performance Metrics
In this section, we consolidate the performance metrics of all the models we have trained into a single DataFrame. This structure allows us to efficiently compare the Mean Squared Error (MSE) and Mean Absolute Error (MAE) across the following models:
- Linear Regression
- Random Forest
- Gradient Boosting
- XGBoost

These metrics are crucial for evaluating the accuracy of each model, with MSE measuring the average of the squares of the errors (indicating the variance from the actual values), and MAE providing a linear score that represents the average magnitude of the errors.

### Visualizing Model Comparison
#### Mean Squared Error Comparison
We create a bar chart to visually compare the MSE of each model. This graph highlights the model's performance in terms of error minimization, where a lower MSE value suggests a model with better predictive accuracy, indicating fewer and smaller errors in predictions.

#### Mean Absolute Error Comparison
Similarly, we plot the MAE for each model using a bar chart. This measure helps us understand which model predicts more closely to the actual values on average, with a lower MAE indicating a more accurate and consistent model.

### Interpretation of Results
These visualizations are instrumental in providing a clear and immediate comparison of model performance. By examining both MSE and MAE, we can determine not only which models perform best on average but also which are most reliable in terms of consistent prediction accuracy. This comprehensive analysis aids in making an informed decision about the best model to deploy for predicting television subscription rates based on the given dataset and project objectives.

In [None]:
# Storing performance metrics in a DataFrame
model_performance = pd.DataFrame({
    "Model": ["Linear Regression", "Random Forest", "Gradient Boosting", "XGBoost"],
    "MSE": [mse, mse_rf, mse_gb, mse_xgb],
    "MAE": [mae, mae_rf, mae_gb, mae_xgb]
})

# Plotting MSE Comparison
plt.figure(figsize=(10, 5))
plt.bar(model_performance['Model'], model_performance['MSE'], color='blue')
plt.title('Comparison of Models by MSE')
plt.ylabel('Mean Squared Error')
plt.show()

# Plotting MAE Comparison
plt.figure(figsize=(10, 5))
plt.bar(model_performance['Model'], model_performance['MAE'], color='green')
plt.title('Comparison of Models by MAE')
plt.ylabel('Mean Absolute Error')
plt.show()