**Understand the Dataset**
The dataset contains the following columns:

    Date: The date of the recorded data.
    Usage: The amount of time spent on the app (presumably in minutes).
    Notifications: The number of notifications received from the app.
    Times opened: The number of times the app was opened.
    App: The name of the apps (in this case, Instagram And Whatsapp).

# **Exploratory Data Analysis (EDA)**

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

In [None]:
# Load the dataset
df = pd.read_csv("/content/dataset/Screentime-App-Details-Dataset.csv")

In [None]:
df.head()

Unnamed: 0,Date,Usage,Notifications,Times opened,App
0,2022-08-26,38,70,49,Instagram
1,2022-08-27,39,43,48,Instagram
2,2022-08-28,64,231,55,Instagram
3,2022-08-29,14,35,23,Instagram
4,2022-08-30,3,19,5,Instagram


In [None]:
# Check for missing values
df.isnull().sum()

Unnamed: 0,0
Date,0
Usage,0
Notifications,0
Times opened,0
App,0


In [None]:
# Convert the 'Date' column to a datetime format for easier manipulation.
df['Date'] = pd.to_datetime(df['Date'])

Descriptive Statistics:

In [None]:
# Calculate summary statistics (mean, median, mode, standard deviation) for numerical columns.
df.describe()

Unnamed: 0,Date,Usage,Notifications,Times opened
count,54,54.0,54.0,54.0
mean,2022-09-08 00:00:00,65.037037,117.703704,61.481481
min,2022-08-26 00:00:00,1.0,8.0,2.0
25%,2022-09-01 06:00:00,17.5,25.75,23.5
50%,2022-09-08 00:00:00,58.5,99.0,62.5
75%,2022-09-14 18:00:00,90.5,188.25,90.0
max,2022-09-21 00:00:00,244.0,405.0,192.0
std,,58.317272,97.01753,43.836635


In [None]:
# Count the unique apps in the dataset.
df['App'].value_counts()

Unnamed: 0_level_0,count
App,Unnamed: 1_level_1
Instagram,27
Whatsapp,27


Visualizations:

In [None]:
# Time series plot of usage over time.
fig = px.line(df, x='Date', y='Usage', color='App', title='App Usage Over Time',
              labels={'Usage': 'Minutes', 'Date': 'Date'})
fig.update_layout(xaxis_title='Date', yaxis_title='Usage (Minutes)')
fig.show()

Insight from the app usage plot:

    WhatsApp has higher overall usage.
    The overall higher usage of WhatsApp suggests it might be a more frequently used communication platform for the user.

In [None]:
# Bar chart showing the average usage per app.
avg_usage = df.groupby('App')['Usage'].mean().reset_index()
fig = px.bar(avg_usage, x='App', y='Usage', title='Average Usage Per App',
             labels={'Usage': 'Average Usage (Minutes)', 'App': 'App'})
fig.update_layout(xaxis_title='App', yaxis_title='Average Usage (Minutes)')
fig.show()

Insight from the Bar Chart:

    The chart clearly shows that WhatsApp has a significantly higher average usage time compared to Instagram. The bar for WhatsApp is considerably taller than the one for Instagram.

In [None]:
# Scatter plot to visualize the relationship between notifications and usage.
fig = px.scatter(df, x='Notifications', y='Usage', color='App',
                 title='Notifications vs Usage',
                 labels={'Notifications': 'Number of Notifications', 'Usage': 'Usage (Minutes)'})
fig.update_traces(marker=dict(size=10, opacity=0.7))
fig.show()

Insight from the Scatter Plot:

    WhatsApp and Notifications: For WhatsApp, there seems to be a positive correlation between the number of notifications and usage time. As the number of notifications increases, the usage time tends to increase as well. This suggests that notifications might be a significant driver of WhatsApp usage.

    Instagram and Notifications: For Instagram, the relationship is less clear-cut. There doesn't appear to be a strong positive or negative correlation. The data points are more scattered, indicating that notifications might not have a consistent impact on Instagram usage.

In [None]:
# Correlation Heatmap
correlation_matrix = df[['Usage', 'Notifications', 'Times opened']].corr().reset_index().melt(id_vars='index')
# Use keyword arguments for index, columns, and values
fig = px.imshow(correlation_matrix.pivot(index='index', columns='variable', values='value'),
                title='Correlation Heatmap',
                labels=dict(x='Features', y='Features', color='Correlation'))
fig.update_layout(xaxis_title='Features', yaxis_title='Features')
fig.show()

Insights From The Correlation Heatmap:

    Strong Positive Correlations:
        Usage vs. Notifications: A strong positive correlation is evident between "Usage" and "Notifications." This suggests that as the number of notifications increases, the usage time of the app also tends to increase.
        Usage vs. Times Opened: Similarly, there is a strong positive correlation between "Usage" and "Times Opened." This indicates that the more times the app is opened, the higher the usage time.
        Notifications vs. Times Opened: A strong positive correlation exists between "Notifications" and "Times Opened." This implies that an increase in notifications often leads to a higher frequency of opening the app.

    Feature Relationships:
        All three features ("Usage," "Notifications," and "Times Opened") exhibit strong positive correlations with each other. This suggests that these features are closely related and likely influence each other.

# **Model Building with Scikit-Learn**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

Feature Selection:

In [None]:
# Feature Engineering and Scaling
X = df[['Notifications', 'Times opened']]
y = df['Usage']

# Standardize Features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Linear Regression Model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
lin_reg_predictions = lin_reg.predict(X_test)

# Ridge Regression (Regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
ridge_predictions = ridge.predict(X_test)

# Lasso Regression (Regularization)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
lasso_predictions = lasso.predict(X_test)

# Random Forest Regressor with Hyperparameter Tuning
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
}
rf_reg = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(rf_reg, param_grid, cv=3, scoring='r2')
grid_search.fit(X_train, y_train)
best_rf = grid_search.best_estimator_
rf_predictions = best_rf.predict(X_test)

# Model Evaluation
def evaluate_model(name, y_true, y_pred):
    print(f"\n{name} Performance:")
    print(f"Mean Absolute Error (MAE): {mean_absolute_error(y_true, y_pred):.2f}")
    print(f"Mean Squared Error (MSE): {mean_squared_error(y_true, y_pred):.2f}")
    print(f"R-squared: {r2_score(y_true, y_pred):.2f}")

# Evaluate Linear Regression
evaluate_model("Linear Regression", y_test, lin_reg_predictions)

# Evaluate Ridge Regression
evaluate_model("Ridge Regression", y_test, ridge_predictions)

# Evaluate Lasso Regression
evaluate_model("Lasso Regression", y_test, lasso_predictions)

# Evaluate Random Forest Regressor
evaluate_model("Random Forest Regressor", y_test, rf_predictions)



Linear Regression Performance:
Mean Absolute Error (MAE): 26.21
Mean Squared Error (MSE): 1277.34
R-squared: 0.70

Ridge Regression Performance:
Mean Absolute Error (MAE): 26.50
Mean Squared Error (MSE): 1300.28
R-squared: 0.69

Lasso Regression Performance:
Mean Absolute Error (MAE): 26.24
Mean Squared Error (MSE): 1280.17
R-squared: 0.70

Random Forest Regressor Performance:
Mean Absolute Error (MAE): 30.62
Mean Squared Error (MSE): 1478.73
R-squared: 0.65


Insights into model performance:

    Linear Regression and Lasso Regression both achieve the best R-squared value of 0.70, indicating they explain 70% of the variance in the data. The Lasso regression model has comparable MAE and MSE, showing it performs similarly to plain linear regression but with potential benefits in feature selection if more features were included.

    Ridge Regression slightly underperforms compared to Linear Regression and Lasso, with an R-squared of 0.69. However, this difference is minimal, and it might be more robust to multicollinearity in larger datasets.

    Random Forest Regressor, even after hyperparameter tuning, lags behind linear models with an R-squared of 0.65, which could indicate:
        The data's relationships are simpler and better captured by linear methods.
        The dataset is small (55 rows), limiting the effectiveness of a complex model like Random Forest.

In [None]:
# `best_rf` is the trained Random Forest model and `X` contains the feature names
feature_importance = pd.DataFrame({
    'Feature': ['Notifications', 'Times opened'],  
    'Importance': best_rf.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Plotting Feature Importance
fig = px.bar(feature_importance,
             x='Feature',
             y='Importance',
             title='Feature Importance from Random Forest',
             labels={'Importance': 'Importance Score', 'Feature': 'Feature'},
             text='Importance')

fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.update_layout(xaxis_title='Feature', yaxis_title='Importance Score')
fig.show()

Insights From The Graph:

    "Times Opened" is the Most Important Feature: The bar corresponding to "Times Opened" is significantly taller than the bar for "Notifications," indicating that the number of times the app is opened has a much stronger influence on the predicted usage time than the number of notifications received.

    Relative Importance:
        "Times Opened" has an importance score of approximately 0.64, suggesting it contributes substantially to the model's predictions.
        "Notifications" has an importance score of around 0.36, indicating a moderate level of influence.