# Introduction to AI - Final Project 

## Part A. Machine Learning Project (Classification)

**Dataset: Processed COVID-19 Data**

**Project Overview:**

In this project, you should use the COVID-19 data collect by Johns Hopkins University (https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports) for a simple classification task.



### Instructions and tasks:

**1.	Load Data**:

Load all the CSV files regarding the Agust 2020, containing COVID-19 data from the provided address and combine them into a unified dataset.

08-01-2020.csv

08-02-2020.csv

08-03-2020.csv

...

08-28-2020.csv

08-29-2020.csv

08-30-2020.csv

08-31-2020.csv

**Hint:** If you want to concatenate the files vertically (stack them on top of each other), you can use the 'concat' function in Python and Pandas.
If you don't want to use Python here and prefer a graphical tool, you can use software like Microsoft Excel or Google Sheets to import the CSV files into worksheets and then copy/paste or import the data as needed. 


In [3]:
import requests
from tqdm import tqdm 

# dates in August 2020
dates = [
    "08-01-2020", "08-02-2020", "08-03-2020", "08-04-2020", "08-05-2020",
    "08-06-2020", "08-07-2020", "08-08-2020", "08-09-2020", "08-10-2020",
    "08-11-2020", "08-12-2020", "08-13-2020", "08-14-2020", "08-15-2020",
    "08-16-2020", "08-17-2020", "08-18-2020", "08-19-2020", "08-20-2020",
    "08-21-2020", "08-22-2020", "08-23-2020", "08-24-2020", "08-25-2020",
    "08-26-2020", "08-27-2020", "08-28-2020", "08-29-2020", "08-30-2020",
    "08-31-2020"
]

# specify the base URL
base_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"

# download CSV files
for date in tqdm(dates, desc="Downloading files", unit="file"):
    file_url = f"{base_url}{date}.csv"
    response = requests.get(file_url)
    
    # check if the request was successful
    if response.status_code == 200:
        with open(f"{date}.csv", 'wb') as file:
            file.write(response.content)
    else:
        print(f"Failed to download {date}.csv")

print("Download complete.")

Downloading files: 100%|██████████| 31/31 [00:18<00:00,  1.68file/s]

Download complete.





In [4]:
import pandas as pd

# declare result to be a pandas datframe
result = pd.DataFrame()

for date in tqdm(dates, desc="Loading and merging", unit="file"):
    file_path = f"{date}.csv"
    try:
        data = pd.read_csv(file_path)
        result = pd.concat([result, data], ignore_index=True)
    except FileNotFoundError:
        print(f"File {file_path} not found. Skipping.")

# check what the merged data looks like in the pd dataframe
result.head()


Loading and merging: 100%|██████████| 31/31 [00:00<00:00, 76.00file/s]


Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio
0,45001.0,Abbeville,South Carolina,US,2020-08-02 04:34:47,34.223334,-82.461707,288,7,0,281,"Abbeville, South Carolina, US",1174.21617,2.430556
1,22001.0,Acadia,Louisiana,US,2020-08-02 04:34:47,30.295065,-92.414197,2331,71,0,2260,"Acadia, Louisiana, US",3756.9506,3.045903
2,51001.0,Accomack,Virginia,US,2020-08-02 04:34:47,37.767072,-75.632346,1077,15,0,1062,"Accomack, Virginia, US",3332.714445,1.392758
3,16001.0,Ada,Idaho,US,2020-08-02 04:34:47,43.452658,-116.241552,8004,62,0,7942,"Ada, Idaho, US",1662.004996,0.774613
4,19001.0,Adair,Iowa,US,2020-08-02 04:34:47,41.330756,-94.471059,20,0,0,20,"Adair, Iowa, US",279.642058,0.0


**2. Drop the Case-Fatality_Ration column from the dataset**

Note: we remove this column to calculate it based on two other features in the next step. This action is solely undertaken for the purpose of practicing feature engineering in this project.

In [5]:
# drop the Case-Fatality_Ration column from the dataset

result = result.drop(columns=['Case-Fatality_Ratio'])

result.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate
0,45001.0,Abbeville,South Carolina,US,2020-08-02 04:34:47,34.223334,-82.461707,288,7,0,281,"Abbeville, South Carolina, US",1174.21617
1,22001.0,Acadia,Louisiana,US,2020-08-02 04:34:47,30.295065,-92.414197,2331,71,0,2260,"Acadia, Louisiana, US",3756.9506
2,51001.0,Accomack,Virginia,US,2020-08-02 04:34:47,37.767072,-75.632346,1077,15,0,1062,"Accomack, Virginia, US",3332.714445
3,16001.0,Ada,Idaho,US,2020-08-02 04:34:47,43.452658,-116.241552,8004,62,0,7942,"Ada, Idaho, US",1662.004996
4,19001.0,Adair,Iowa,US,2020-08-02 04:34:47,41.330756,-94.471059,20,0,0,20,"Adair, Iowa, US",279.642058


**3. Feature Engineering:**

Create a new feature 'CFR' (Case Fatality Rate) using the formula: (Deaths / Confirmed) * 100.


In [6]:
# feature engineering

result['CFR'] = (result['Deaths'] / result['Confirmed']) * 100

result.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,CFR
0,45001.0,Abbeville,South Carolina,US,2020-08-02 04:34:47,34.223334,-82.461707,288,7,0,281,"Abbeville, South Carolina, US",1174.21617,2.430556
1,22001.0,Acadia,Louisiana,US,2020-08-02 04:34:47,30.295065,-92.414197,2331,71,0,2260,"Acadia, Louisiana, US",3756.9506,3.045903
2,51001.0,Accomack,Virginia,US,2020-08-02 04:34:47,37.767072,-75.632346,1077,15,0,1062,"Accomack, Virginia, US",3332.714445,1.392758
3,16001.0,Ada,Idaho,US,2020-08-02 04:34:47,43.452658,-116.241552,8004,62,0,7942,"Ada, Idaho, US",1662.004996,0.774613
4,19001.0,Adair,Iowa,US,2020-08-02 04:34:47,41.330756,-94.471059,20,0,0,20,"Adair, Iowa, US",279.642058,0.0


#### **4.	Data Exploration:**

Display basic statistics and information about the dataset.

Display summary statistics for the numerical columns.

In [7]:
# display basic statistics and info about the data

print("Data information: ")
print(result.info())

Data information: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123418 entries, 0 to 123417
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   FIPS            100809 non-null  float64
 1   Admin2          100964 non-null  object 
 2   Province_State  117962 non-null  object 
 3   Country_Region  123418 non-null  object 
 4   Last_Update     123418 non-null  object 
 5   Lat             120855 non-null  float64
 6   Long_           120855 non-null  float64
 7   Confirmed       123418 non-null  int64  
 8   Deaths          123418 non-null  int64  
 9   Recovered       123418 non-null  int64  
 10  Active          123418 non-null  int64  
 11  Combined_Key    123418 non-null  object 
 12  Incidence_Rate  120855 non-null  float64
 13  CFR             121515 non-null  float64
dtypes: float64(5), int64(4), object(5)
memory usage: 13.2+ MB
None


In [8]:
# display summery statistics for numerical columns

print("Summary Statistics for Numerical Columns:")
print(result.describe())

Summary Statistics for Numerical Columns:
                FIPS            Lat          Long_      Confirmed  \
count  100809.000000  120855.000000  120855.000000  123418.000000   
mean    32370.754744      35.738223     -71.401679    5450.939960   
std     17976.483772      13.341651      54.822590   28376.383585   
min        66.000000     -71.949900    -175.198200       0.000000   
25%     19049.000000      33.188201     -96.577496      80.000000   
50%     30063.000000      37.877361     -86.803732     323.000000   
75%     47039.000000      42.132991     -77.443993    1491.000000   
max     99999.000000      71.706900     178.065000  804342.000000   

              Deaths     Recovered         Active  Incidence_Rate  \
count  123418.000000  1.234180e+05  123418.000000   120855.000000   
mean      207.275722  3.431031e+03    2364.668428     1071.370957   
std      1466.677148  3.622184e+04   11012.319484     1120.388169   
min         0.000000  0.000000e+00       0.000000        0.0

  sqr = _ensure_numeric((avg - values) ** 2)


**5.	Define Target Variable:**

Define a binary target variable (e.g., "High CFR" or "Low CFR") based on a threshold (min value of the column) of CFR.


In [9]:
# define target variable and threshold

# set the threshold as the minimum value of the 'CFR' column
threshold_cfr = result['CFR'].min()

# create a new column 'CFR_Label' with binary labels
result['CFR_Label'] = result['CFR'].apply(lambda x: 'High CFR' if x > threshold_cfr else 'Low CFR')

result['CFR_Label'].value_counts()

CFR_Label
High CFR    93887
Low CFR     29531
Name: count, dtype: int64

**6.	Split Data:**

Split the data into training and test sets.


In [10]:
# split data
from sklearn.model_selection import train_test_split

# Specify the features (X) and the target variable (y)
X = result[['Confirmed', 'Deaths']]
y = result['CFR_Label']

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape - X:", X_train.shape, "y:", y_train.shape)
print("Test set shape - X:", X_test.shape, "y:", y_test.shape)


Training set shape - X: (98734, 2) y: (98734,)
Test set shape - X: (24684, 2) y: (24684,)


**7.	Handle Missing Data:**

Implement strategies to handle any remaining missing data.


In [11]:
# check the missing values and drop rows if any

if X_train.isnull().sum().any():
    X_train = X_train.dropna()

if X_test.isnull().sum().any():
    X_test = X_test.dropna()

if y_train.isnull().sum().any():
    y_train = y_train.dropna()

if y_test.isnull().sum().any():
    y_test = y_test.dropna()
    

**8.	Convert Categorical Variables (if applicable):**

If there are categorical variables (e.g., country), encode them using techniques like one-hot encoding.


In [12]:
# convert categorical variable 'CFR_Label' using one-hot encoding
result_encoded = pd.get_dummies(result, columns=['CFR_Label'], drop_first=True)

result_encoded.head()

# we are not converting other variables as they are not needed for our classification model

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,CFR,CFR_Label_Low CFR
0,45001.0,Abbeville,South Carolina,US,2020-08-02 04:34:47,34.223334,-82.461707,288,7,0,281,"Abbeville, South Carolina, US",1174.21617,2.430556,False
1,22001.0,Acadia,Louisiana,US,2020-08-02 04:34:47,30.295065,-92.414197,2331,71,0,2260,"Acadia, Louisiana, US",3756.9506,3.045903,False
2,51001.0,Accomack,Virginia,US,2020-08-02 04:34:47,37.767072,-75.632346,1077,15,0,1062,"Accomack, Virginia, US",3332.714445,1.392758,False
3,16001.0,Ada,Idaho,US,2020-08-02 04:34:47,43.452658,-116.241552,8004,62,0,7942,"Ada, Idaho, US",1662.004996,0.774613,False
4,19001.0,Adair,Iowa,US,2020-08-02 04:34:47,41.330756,-94.471059,20,0,0,20,"Adair, Iowa, US",279.642058,0.0,True


**9. Train a Classification Model:**

Choose a classification algorithm (e.g., Logistic Regression, Decision Tree, Random Forest).

Train the model to predict the "High CFR" or "Low CFR" label based on the selected features.


In [13]:
# train a classification model (logistic regression in our case)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score, f1_score

X = result[['Confirmed', 'Deaths']]
y = result_encoded['CFR_Label_Low CFR']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# initialize the Logistic Regression model
model = LogisticRegression(random_state=42)

# train the model on the training set
model.fit(X_train, y_train)

# predict the labels for the test set
y_pred = model.predict(X_test)


**10. Model Evaluation:**

Evaluate the model's performance using appropriate metrics (e.g., accuracy).


In [14]:
# model evaluation

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# display the evaluation metrics
print("Accuracy:", accuracy)
print("\nConfusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", class_report)
print("\nPrecision:", precision)
print("\nRecall:", recall)
print("\nF1 Score:", f1)

Accuracy: 1.0

Confusion Matrix:
 [[18756     0]
 [    0  5928]]

Classification Report:
               precision    recall  f1-score   support

       False       1.00      1.00      1.00     18756
        True       1.00      1.00      1.00      5928

    accuracy                           1.00     24684
   macro avg       1.00      1.00      1.00     24684
weighted avg       1.00      1.00      1.00     24684


Precision: 1.0

Recall: 1.0

F1 Score: 1.0


**11.	Documentation:**

Provide a report summarizing the findings, including insights from the exploratory analysis.


You can write your short report here:


The dataset was split in 80:20 for training and tests, at 98734 and 24684 samples respectively. 
The performance of the model reaches 1, or fully precise.

The logistic regression model achieved perfect accuracy on the test set.
Precision, recall, and F1-score were all 1.0 for both classes, indicating a high level of predictive performance.

The perfect accuracy of the model could be due to overfitting of the values or having data leakage. And should be tested on another independent dataset to verify for errors.*

## Part B. Machine Learning Project (Regression)

In this part, you need to use the same data that you have used in the first part (Part A) of this project. 

The aim of this part is to build a simple regression model on the COVID-19 data that has been loaded through the first step of Part A.

**1.Define Target Variable:**

Define the target variable (e.g., Deaths) for the regression model.


In [15]:
# define target variables

X = result[['Confirmed', 'Recovered']] 
y = result['Deaths']

**2. Split Data:**

Split the data into training and testing sets.


In [16]:
# split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape - X:", X_train.shape, "y:", y_train.shape)
print("Test set shape - X:", X_test.shape, "y:", y_test.shape)

Training set shape - X: (98734, 2) y: (98734,)
Test set shape - X: (24684, 2) y: (24684,)


**3.	Handle Missing Data:**

Implement strategies to handle any remaining missing data.


In [17]:
# handle missing data

# check for missing values in the training set
if X_train.isnull().any().any():
    # fill missing values with the mean
    X_train = X_train.fillna(X_train.mean())

# check for missing values in the test set
if X_test.isnull().any().any():
    # fill missing values with the mean
    X_test = X_test.fillna(X_test.mean())


**4.	Convert Categorical Variables (if applicable):**

If there are categorical variables (e.g., country), encode them using techniques like one-hot encoding.


In [18]:
# convert categorical variables

# no need for that here

**5. Train a regression Model:**

Choose a regression algorithm (e.g., Linear Regression, Random Forest Regression)..

Train the model to predict the total deaths based on the selected features.


In [19]:
# train a regression model (linear regression in our case)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, median_absolute_error

# initialize the Linear Regression model
regression_model = LinearRegression()


# train the model on the training set
regression_model.fit(X_train, y_train)

# predict the target variable for the test set
y_pred = regression_model.predict(X_test)

**6. Model Evaluation:**

Evaluate the model's performance using appropriate regression metrics (e.g., Mean Absolute Error, R-squared).


In [21]:
# model evaluation

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mde = median_absolute_error(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)
print("Mean Absolute Error:", mae)
print("Median Absolute Error:", mde)


Mean Squared Error: 1115243.347313111
R-squared: 0.5216224793116986
Mean Absolute Error: 150.53034717574616
Median Absolute Error: 8.12417870408489


**7. Documentation:**

Provide a report summarizing the findings, including insights from the exploratory analysis.


You can write your short report here:

The MSE is relatively high, indicating that there might be a considerable amount of variability that the model is not capturing well.
The R² value of 0.5216, or 52%, suggests that the model explains a moderate proportion of the variance, but there is still room for improvement.
Mean absolute gives how far off the model is from actuality giving 150 units and being quite significant.
The median is smaller, less sensitive measure to extreme values.

Compared to the linear model, which had a near perfect fit, the regression model does not. 

Other algorithms could be used to test their performance against Linear Regressaion and ivnvestigating the outliers and anomalies in the data which are causing the significant variability

-----------------------------------------------------------------------------------------------------