# CSI 4106 Introduction to Artificial Intelligence 
## Assignment 1: Data Preparation

## Report Title: Data Preparation and Feature Engineering for Machine Learning Classification

### Identification

Name: Alex Govier <br/>
Student Number: 300174954

### 1. Dataset Selection

#### Import Necessary Libraries

In [48]:
import pandas as pd
import numpy as np

#### Read Datasets

Here I will read all the datasets and store them in their own respective variables and then in a list and separate list of names so they can be used later.

In [49]:
# Reading each dataset from my GitHub and then declaring them as individual variables
car_url = "https://github.com/alex-govier5/intro-to-ai/raw/master/A1/car_dataset.csv"
car_dataset = pd.read_csv(car_url)

credit_scores_url = "https://github.com/alex-govier5/intro-to-ai/raw/master/A1/credit_scores_dataset.csv"
credit_scores_dataset = pd.read_csv(credit_scores_url, low_memory=False) # Needed low_memory option to suppress warning about mixed dtypes 

dermatology_url = "https://github.com/alex-govier5/intro-to-ai/raw/master/A1/dermatology_dataset.csv"
dermatology_dataset = pd.read_csv(dermatology_url)

glass_url = "https://github.com/alex-govier5/intro-to-ai/raw/master/A1/glass_dataset.csv"
glass_dataset = pd.read_csv(glass_url)

maternal_health_url = "https://github.com/alex-govier5/intro-to-ai/raw/master/A1/maternal_health_dataset.csv"
maternal_health_dataset = pd.read_csv(maternal_health_url)

sixteen_p_url = "https://github.com/alex-govier5/intro-to-ai/raw/master/A1/sixteen_p_dataset.csv"
sixteen_p_dataset = pd.read_csv(sixteen_p_url, encoding='ISO-8859-1') # Needed special encoding, the default of utf-8 wasn't working

wine_qt_url = "https://github.com/alex-govier5/intro-to-ai/raw/master/A1/wine_qt_dataset.csv"
wine_qt_dataset = pd.read_csv(wine_qt_url)

# Creating a list of the datasets and a separate list of arbitrary names for them
datasets = [car_dataset, credit_scores_dataset, dermatology_dataset, glass_dataset, maternal_health_dataset, sixteen_p_dataset, wine_qt_dataset]
dataset_names = ['Car Dataset', 'Credit Score Dataset', 'Dermatology Dataset', 'Glass Dataset', 'Maternal Health Dataset', '16P Dataset', 'Wine Dataset']

### 2. Exploratory Analysis

#### 1. Analysis of Missing Values

Here I will examine the datasets to identify and assess missing values in various attributes.

##### 1.1 Finding datasets and attributes with missing values

The datasets that contain missing values are the Credit Score Dataset and the Dermatology Dataset. Within the Credit Score Dataset, the 'Name', 'Occupation', 'Monthly_Inhand_Salary', 'Type_of_Loan', 'Num_of_Delayed_Payment', 'Num_Credit_Inquiries', 'Credit_History_Age', 'Amount_invested_monthly', and 'Monthly_Balance' attributes all had missing values of some sort. Within the Dermatology Dataset, only the 'age' attribute had missing values of some sort.

##### 1.2 Describe Methodology Used For Investigation
I first create a list of generic missing value placeholders that could be used in the datasets, some of them I included simply by giving a quick glance to some datasets and seeing these placeholders a lot, like "_______" and "Not Specified" for example. <br/>
I then loop through each dataset and replace every placeholder value with NaN because Pandas can identify NaN values very efficiently, and this way I won't have to check multiple columns for multiple different potential placeholders, so the search becomes simpler. <br/>
I then check to see if the dataset contains any NaN values, if so I print the name of that dataset. <br/>
Then the columns for that dataset are checked and the ones that include NaN values will be listed so we are able to see which attributes contain the missing values. <br/>
I then display how many values are actually missing from each column with the sum() function just so I can make more informed decisions when it comes to imputation and which dataset I want to select for later use. <br/>
I also display how many total rows are in the dataset so I can have an idea of what percentage of data is actually missing from the given columns. <br/>
Otherwise, if the dataset contains no missing values, I state that in the else section.

In [50]:
# Create list of generic placeholders that could be used for missing values
missing_value_placeholders = ['?', '', 'NA', 'N/A', 'null', '-', '_______', 'Not Specified']

# Loop through each dataset
for i, dataset in enumerate(datasets):
    # Replace the placeholders with NaN so Pandas can recognize missing values easier
    dataset.replace(missing_value_placeholders, np.nan, inplace=True)

    # Check if there are any missing values
    if dataset.isnull().values.any():
        # Indicate the dataset has missing values
        print(f"{dataset_names[i]} has missing values.")
        
        # Find the columns with missing values
        missing_columns = dataset.columns[dataset.isnull().any()]
        print(f"Columns with missing values in {dataset_names[i]}: {missing_columns}\n")

        # Print the number of missing values per column
        print(dataset[missing_columns].isnull().sum())

        # Print total number of rows for reference
        print(f"Total rows: {len(dataset)}")
        print("\n")
        
    else:
        # Indicate that the dataset has no missing values otherwise
        print(f"{dataset_names[i]} has no missing values.\n")

Car Dataset has no missing values.

Credit Score Dataset has missing values.
Columns with missing values in Credit Score Dataset: Index(['Name', 'Occupation', 'Monthly_Inhand_Salary', 'Type_of_Loan',
       'Num_of_Delayed_Payment', 'Num_Credit_Inquiries', 'Credit_History_Age',
       'Amount_invested_monthly', 'Monthly_Balance'],
      dtype='object')

Name                        9985
Occupation                  7062
Monthly_Inhand_Salary      15002
Type_of_Loan               12816
Num_of_Delayed_Payment      7002
Num_Credit_Inquiries        1965
Credit_History_Age          9030
Amount_invested_monthly     4479
Monthly_Balance             1200
dtype: int64
Total rows: 100000


Dermatology Dataset has missing values.
Columns with missing values in Dermatology Dataset: Index(['age'], dtype='object')

age    8
dtype: int64
Total rows: 366


Glass Dataset has no missing values.

Maternal Health Dataset has no missing values.

16P Dataset has no missing values.

Wine Dataset has no missing

##### 1.3 Data Imputation Propositions
First I will give my suggestions for the Credit Scores dataset: <br/>
For the Name attribute, roughly 10% of the names are missing, since Name is a categorical attribute and will probably unique to most individuals, imputing a name might not be the best practice, and since this attribute won't be very useful for the model overall, I would suggest dropping the column in this case. <br/>
For the Occupation attribute, roughly 7% of them are missing, in this case I would use mode imputation to fill in the values with the most frequent occupation. <br/>
For the Monthly_Inhand_Salary attribute, roughly 15% of them are missing, since this is a continuous numeric attribute, I would probably use mean or median imputation, if I find the salary distribution is skewed, I would use median imputation to make sure the outliers aren't affecting the data too much. <br/>
For the Type_of_Loan attribute, roughly 13% are missing, I would again use mode imputation here since it is a categorical attribute. <br/>
For the Num_of_Delayed_Payment attribute, roughly 7% are missing, I would again consider using mean or median imputation because it's a continuous numeric attribute, median if the distribution is heavily skewed. <br/>
For the Num_Credit_Inquiries attribute, roughly 2% are missing, so again I would use mean or median imputation for the same reasons as above. <br/>
For the Credit_History_Age attribute, roughly 9% are missing, I would lean more towards the median imputation method here because the age of the credit history may be more likely to be skewed, but mean could still work well. <br/>
For the Amount_invested_monthly attribute, roughly 5% are missing, I would again use mean or median imputation here. <br/>
For the Monthly_Balance attribute, roughly 1% are missing, so again mean or median imputation should work well here. <br/> <br/>
Now for the Dermatology dataset: <br/>
For the age attribute, roughly 2% of the ages are missing, so since age is numeric and a very small percentage is missing, I again think a median or mean imputation method would work best. <br/>
These would be my suggestions for the data imputation of the missing values.

#### 2. Selecting a Classification Task
The dataset I will select for my classification task is the Credit Scores dataset. This dataset stores customer information and their associated credit scores. Plenty of useful, relevant information about the customers is stored in this dataset, like annual incomes, number of delayed payments, credit utilization ratio, outstanding debt, just to name a few. So this seems like a great dataset to work with, and I will provide further justification for selecting this dataset below. <br/>
For the classification task, I have decided to try and predict a customer's credit score based on numerous attributes from the Credit Scores dataset. <br/>
I have chosen this task because it has real-world relevance, as predicting credit scores is a common use case in the finance industry for determining different things like loans, interest rates, credit limits etc. <br/>
I am also working with a very feature rich dataset, by being able to include mutliple relevant attributes, my predictions should end up being quite accurate in theory. I also have a very large dataset, with 100,000 total rows, it is by far the largest dataset out of all the initial sets looked at. So I will have plenty of rich data to work with when making predictions. And I already have an imputation strategy as outlined above for filling in missing values, so the missing values shouldn't cause too many problems for the predictions. <br/>

##### 2.1 Objective of Task
As mentioned above, the objective of this classification task will be to accurately predict a customer's credit score based off of the relevant attributes in the dataset. In terms of applications, it can be used to determine loan eligibility, interest rates, credit limits and more for a given customer. I personally don't possess a lot of expertise in these particular domain applications, however I'm sure I will be able to learn a lot from this classification task. In the next section I will focus on which attributes to analyze during classification so that I can make the most accurate predictions possible.

#### 3. Attribute Analysis
Here I will analyze the attributes in my dataset to see which ones should be included in my investigation.

##### 3.1 Determining Attribute Informativeness

--------------------------------------------------------------------------

### References

