# Checkpoint Objective
You have been tasked to gather COVID-19 data from the API of your choice that contains both noise and valuable data. After that, you will clean and pre-process the data, perform exploratory data analysis (EDA) to gain insights, and select the best-suited supervised algorithm to predict the future number of cases. Finally, you will deploy the model using Streamlit.

Hints:
1. Use the requests library to fetch data from the API.
2. Use the Pandas library to store the data in a DataFrame and clean it.
3. Use the plotly  library to create visualizations.
4. Use the Scikit-learn library to build and optimize the model.
5. Use the Streamlit library to create the user interface.
 

In [1]:
# Import libraries
import requests
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import plotly.express as px
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
import joblib

#### 1. Choose a COVID-19 API of your choice that contains both valuable data and noise.

In [2]:
api_url = "https://api.covidtracking.com/v1/us/daily.json"

#### 2. Use Python to gather the data from the API and store it in a Pandas DataFrame.

In [3]:
response = requests.get(api_url)
data = response.json()
dt = pd.DataFrame(data)
pd.set_option('display.max_columns', None)
dt.head()

Unnamed: 0,date,states,positive,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,onVentilatorCumulative,dateChecked,death,hospitalized,totalTestResults,lastModified,recovered,total,posNeg,deathIncrease,hospitalizedIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease,hash
0,20210307,56,28756489.0,74582825.0,11808.0,40199.0,776361.0,8134.0,45475.0,2802.0,4281.0,2021-03-07T24:00:00Z,515151.0,776361.0,363825123,2021-03-07T24:00:00Z,,0,0,842,726,131835,41835,1170059,a80d0063822e251249fd9a44730c49cb23defd83
1,20210306,56,28714654.0,74450990.0,11783.0,41401.0,775635.0,8409.0,45453.0,2811.0,4280.0,2021-03-06T24:00:00Z,514309.0,775635.0,362655064,2021-03-06T24:00:00Z,,0,0,1680,503,143835,60015,1430992,dae5e558c24adb86686bbd58c08cce5f610b8bb0
2,20210305,56,28654639.0,74307155.0,12213.0,42541.0,775132.0,8634.0,45373.0,2889.0,4275.0,2021-03-05T24:00:00Z,512629.0,775132.0,361224072,2021-03-05T24:00:00Z,,0,0,2221,2781,271917,68787,1744417,724844c01659d0103801c57c0f72bf8cc8ab025c
3,20210304,56,28585852.0,74035238.0,12405.0,44172.0,772351.0,8970.0,45293.0,2973.0,4267.0,2021-03-04T24:00:00Z,510408.0,772351.0,359479655,2021-03-04T24:00:00Z,,0,0,1743,1530,177957,65487,1590984,5c549ad30f9abf48dc5de36d20fa707014be1ff3
4,20210303,56,28520365.0,73857281.0,11778.0,45462.0,770821.0,9359.0,45214.0,3094.0,4260.0,2021-03-03T24:00:00Z,508665.0,770821.0,357888671,2021-03-03T24:00:00Z,,0,0,2449,2172,267001,66836,1406795,fef6c425d2b773a9221fe353f13852f3e4a4bfb0


#### 3. Clean the data by removing any irrelevant columns, null values, or duplicates.

In [4]:
# Remove irrelevant columns
dt = dt[['date', 'positive', 'negative', 'hospitalizedCurrently', 'death']]

# Handle missing values and duplicates (if any)
dt = dt.dropna()
dt = dt.drop_duplicates()
dt

Unnamed: 0,date,positive,negative,hospitalizedCurrently,death
0,20210307,28756489.0,74582825.0,40199.0,515151.0
1,20210306,28714654.0,74450990.0,41401.0,514309.0
2,20210305,28654639.0,74307155.0,42541.0,512629.0
3,20210304,28585852.0,74035238.0,44172.0,510408.0
4,20210303,28520365.0,73857281.0,45462.0,508665.0
...,...,...,...,...,...
351,20200321,30580.0,67803.0,1492.0,335.0
352,20200320,23640.0,51743.0,1042.0,273.0
353,20200319,17540.0,39938.0,617.0,203.0
354,20200318,12934.0,31577.0,416.0,152.0


#### 4. Pre-process the data by normalizing and scaling the numerical data.

In [5]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
Index: 356 entries, 0 to 355
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   356 non-null    int64  
 1   positive               356 non-null    float64
 2   negative               356 non-null    float64
 3   hospitalizedCurrently  356 non-null    float64
 4   death                  356 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 16.7 KB


#### 5. Perform EDA to identify trends, correlations, and patterns in the data. Use visualizations such as histograms, scatter plots, and heatmaps to help you understand the data better.

In [6]:
# Histogram: Visualize the distribution of positive cases
fig_histogram = px.histogram(dt, x='positive', nbins=20, histnorm='probability', title='Distribution of Positive COVID-19 Cases')
fig_histogram.update_layout(xaxis_title='Positive Cases', yaxis_title='Probability')
fig_histogram.show()

# Scatter Plot: Visualize the relationship between positive cases and death cases
fig_scatter = px.scatter(dt, x='positive', y='death', title='Scatter Plot: Positive Cases vs. Death Cases')
fig_scatter.update_layout(xaxis_title='Positive Cases', yaxis_title='Death Cases')
fig_scatter.show()

# Scatter Plot: Visualize the relationship between negative cases and death cases
fig_scatter = px.scatter(dt, x='negative', y='death', title='Scatter Plot: Positive Cases vs. Death Cases')
fig_scatter.update_layout(xaxis_title='Negative Cases', yaxis_title='Death Cases')
fig_scatter.show()

# Heatmap: Visualize the correlation between variables
fig_heatmap = px.imshow(dt.corr(), color_continuous_scale='Viridis', zmin=-1, zmax=1, title='Correlation Heatmap')
fig_heatmap.update_layout(xaxis_title='Variables', yaxis_title='Variables')
fig_heatmap.show()

#### 6. Choose the best-suited supervised algorithm to predict the future number of cases. Use techniques such as train-test split, cross-validation, and grid search to optimize the model's performance.

In [7]:
# Step 1: Prepare the data
X = dt.drop(['date', 'death'], axis = 1)
y = dt['death']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Initialize the Linear Regression model
model = LinearRegression()

# Step 3: Define the hyperparameter grid for grid search
param_grid = {
    'fit_intercept': [True, False],
}

# Step 4: Initialize GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Step 5: Fit the GridSearchCV on the training data
grid_search.fit(X_train, y_train)

# Step 6: Get the best model from the grid search
best_model = grid_search.best_estimator_

# Step 7: Perform cross-validation on the best model (Optional: You can also skip this step and directly proceed to Step 9)
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')

# Step 8: Calculate mean and standard deviation of the cross-validation scores (Optional)
cv_mean_score = np.mean(cv_scores)
cv_std_score = np.std(cv_scores)

# Step 9: Train the best model on the full training set
best_model.fit(X_train, y_train)

# Step 10: Make predictions using the best model on the test set
y_pred = best_model.predict(X_test)

# Step 11: Evaluate the best model's performance on the test set
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)

print("Best Model Parameters:", grid_search.best_params_)
print("Mean Squared Error (on test set):", mse)
print("R-squared (on test set):", r_squared)
print("Cross-Validation Mean Squared Error:", -cv_mean_score) 
print("Cross-Validation Standard Deviation:", cv_std_score)  

Best Model Parameters: {'fit_intercept': True}
Mean Squared Error (on test set): 426272257.71851134
R-squared (on test set): 0.9802203783320138
Cross-Validation Mean Squared Error: 313531630.81324774
Cross-Validation Standard Deviation: 111522789.41586937


#### 7. Once you have chosen the best-suited model, deploy it using Streamlit. Create a user-friendly interface that allows users to input data and view the model's predictions.

In [8]:
# Save the trained model to an HDF5 file using joblib
joblib.dump(best_model, 'best_model.h5')

['best_model.h5']

#### 8. Deploy your streamlit app with streamlit share

The remaining steps will be performed in a python file