# Checkpoint Objective
You have been tasked to gather COVID-19 data from the API of your choice that contains both noise and valuable data. After that, you will clean and pre-process the data, perform exploratory data analysis (EDA) to gain insights, and select the best-suited supervised algorithm to predict the future number of cases. Finally, you will deploy the model using Streamlit.

Hints:
1. Use the requests library to fetch data from the API.
2. Use the Pandas library to store the data in a DataFrame and clean it.
3. Use the plotly  library to create visualizations.
4. Use the Scikit-learn library to build and optimize the model.
5. Use the Streamlit library to create the user interface.
 

In [1]:
# Import libraries
import requests
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
import joblib

#### 1. Choose a COVID-19 API of your choice that contains both valuable data and noise.

In [2]:
api_url = "https://api.covidtracking.com/v1/us/daily.json"

#### 2. Use Python to gather the data from the API and store it in a Pandas DataFrame.

In [3]:
response = requests.get(api_url)
data = response.json()
dt = pd.DataFrame(data)
# Desplay all the columns
pd.set_option('display.max_columns', None)
dt.head()

Unnamed: 0,date,states,positive,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,onVentilatorCumulative,dateChecked,death,hospitalized,totalTestResults,lastModified,recovered,total,posNeg,deathIncrease,hospitalizedIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease,hash
0,20210307,56,28756489.0,74582825.0,11808.0,40199.0,776361.0,8134.0,45475.0,2802.0,4281.0,2021-03-07T24:00:00Z,515151.0,776361.0,363825123,2021-03-07T24:00:00Z,,0,0,842,726,131835,41835,1170059,a80d0063822e251249fd9a44730c49cb23defd83
1,20210306,56,28714654.0,74450990.0,11783.0,41401.0,775635.0,8409.0,45453.0,2811.0,4280.0,2021-03-06T24:00:00Z,514309.0,775635.0,362655064,2021-03-06T24:00:00Z,,0,0,1680,503,143835,60015,1430992,dae5e558c24adb86686bbd58c08cce5f610b8bb0
2,20210305,56,28654639.0,74307155.0,12213.0,42541.0,775132.0,8634.0,45373.0,2889.0,4275.0,2021-03-05T24:00:00Z,512629.0,775132.0,361224072,2021-03-05T24:00:00Z,,0,0,2221,2781,271917,68787,1744417,724844c01659d0103801c57c0f72bf8cc8ab025c
3,20210304,56,28585852.0,74035238.0,12405.0,44172.0,772351.0,8970.0,45293.0,2973.0,4267.0,2021-03-04T24:00:00Z,510408.0,772351.0,359479655,2021-03-04T24:00:00Z,,0,0,1743,1530,177957,65487,1590984,5c549ad30f9abf48dc5de36d20fa707014be1ff3
4,20210303,56,28520365.0,73857281.0,11778.0,45462.0,770821.0,9359.0,45214.0,3094.0,4260.0,2021-03-03T24:00:00Z,508665.0,770821.0,357888671,2021-03-03T24:00:00Z,,0,0,2449,2172,267001,66836,1406795,fef6c425d2b773a9221fe353f13852f3e4a4bfb0


#### 3. Clean the data by removing any irrelevant columns, null values, or duplicates.

In [4]:
# Remove irrelevant columns
dt = dt[['date', 'positive', 'negative', 'death', 'hospitalized']]

# Handle missing values and duplicates (if any)
dt = dt.dropna()
dt = dt.drop_duplicates()

#### 4. Pre-process the data by normalizing and scaling the numerical data.

In [5]:
# Convert the 'date' column to a datetime format and extract year, month, and day
dt['date'] = pd.to_datetime(dt['date'], format='%Y%m%d')
dt['year'] = dt['date'].dt.year
dt['month'] = dt['date'].dt.month
dt['day'] = dt['date'].dt.day

In [6]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 369 entries, 0 to 368
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          369 non-null    datetime64[ns]
 1   positive      369 non-null    float64       
 2   negative      369 non-null    float64       
 3   death         369 non-null    float64       
 4   hospitalized  369 non-null    float64       
 5   year          369 non-null    int64         
 6   month         369 non-null    int64         
 7   day           369 non-null    int64         
dtypes: datetime64[ns](1), float64(4), int64(3)
memory usage: 25.9 KB


#### 5. Perform EDA to identify trends, correlations, and patterns in the data. Use visualizations such as histograms, scatter plots, and heatmaps to help you understand the data better.

In [7]:
# Sort the DataFrame by date
df = dt.sort_values('date')

# Create the line plot for the Evolution of Deaths Over Time
fig = px.line(df, x='date', y='death', title='Evolution of Deaths Over Time')
fig.update_xaxes(title='Date')
fig.update_yaxes(title='Deaths')
fig.show()

# Create the line plot for the Evolution of positive cases Over Time
fig = px.line(df, x='date', y='positive', title='Evolution of positive cases Over Time')
fig.update_xaxes(title='Date')
fig.update_yaxes(title='positive')
fig.show()

# Create the line plot for the Evolution of negative cases Over Time
fig = px.line(df, x='date', y='negative', title='Evolution of negative cases Over Time')
fig.update_xaxes(title='Date')
fig.update_yaxes(title='negative')
fig.show()

# Create the line plot for the Evolution of hospitalized cases Over Time
fig = px.line(df, x='date', y='hospitalized', title='Evolution of hospitalized cases Over Time')
fig.update_xaxes(title='Date')
fig.update_yaxes(title='hospitalized')
fig.show()

# Scatter Plot: Visualize the relationship between positive cases and death cases
fig_scatter = px.scatter(dt, x='positive', y='death', title='Scatter Plot: Positive Cases vs. Death Cases')
fig_scatter.update_layout(xaxis_title='Positive Cases', yaxis_title='Death Cases')
fig_scatter.show()

# Scatter Plot: Visualize the relationship between negative cases and death cases
fig_scatter = px.scatter(dt, x='negative', y='death', title='Scatter Plot: Positive Cases vs. Death Cases')
fig_scatter.update_layout(xaxis_title='Negative Cases', yaxis_title='Death Cases')
fig_scatter.show()

# Heatmap: Visualize the correlation between variables
fig_heatmap = px.imshow(dt.corr(numeric_only=False), color_continuous_scale='Viridis', zmin=-1, zmax=1, title='Correlation Heatmap')
fig_heatmap.update_layout(xaxis_title='Variables', yaxis_title='Variables')
fig_heatmap.show()

#### 6. Choose the best-suited supervised algorithm to predict the future number of cases. Use techniques such as train-test split, cross-validation, and grid search to optimize the model's performance.

In [8]:
# Prepare the data
# Drop the original 'date' column and the target variable 'death'
X = dt.drop(['date', 'death'], axis=1)
y = dt['death']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the KNN Regressor model
model = KNeighborsRegressor()

# Define the hyperparameter grid for grid search
param_grid = {
    'n_neighbors': [3, 5, 7, 9], 
    'weights': ['uniform', 'distance']
}

# Initialize GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the GridSearchCV on the training data
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Train the best model on the full training set
best_model.fit(X_train, y_train)

#### 7. Once you have chosen the best-suited model, deploy it using Streamlit. Create a user-friendly interface that allows users to input data and view the model's predictions.

In [9]:
# Save the trained model to an HDF5 file using joblib
joblib.dump(best_model, 'best_model.h5')

['best_model.h5']

The remaining step will be performed in a python file, here's the link

https://github.com/ahyaiche/DS-GMC-Checkpoints/blob/main/API_Streamlit/29.COVID-19_Data_Analysis_and_Prediction_Checkpoint_AmaniYch.py

#### 8. Deploy your streamlit app with streamlit share

https://ds-gmc-checkpoints-okbfsxupercysg5qhxysqv.streamlit.app/