PROJECT 1

Description:  You should consider yourself as a
new employee in a company who has just been given a data. Your job is to make a useful description of the data set for your co-workers and make some basic plots.

Import relevant packages

In [None]:
# Import relevant packages
import numpy as np
import pandas as pd
from scipy.io import loadmat

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA


# Plotting style
sns.set_style('darkgrid')
sns.set_theme(font_scale=1.)

# 1.  Description of you dataset

• Explain the overall problem of interest and the associated data.  

• Provide a reference to where you obtained the data.  

• Summarize previous analysis of the data. (i.e. go through one or two of the original source papers and read what they did to the data and summarize their results).  


• You will be asked to apply (1) classification and (2) regression on your data in the next report. For now, we want you to consider how this should be done. Therefore:  

– Explain, in the context of your problem of interest, what you hope to accomplish/learn from the data using these techniques?  

– Explain which attribute you wish to predict in the regression based on which other attributes?  

– Which class label will you predict based on which other attributes in the classification task?  

– Explain if you need to transform individual attribues in order to carry out these tasks (e.g. centering, standardization, discretization, log transform, etc.) and how you plan to do this.  

One of these tasks is likely more relevant than the rest and will be denoted the main machine learning aim in the following.
The purpose of the following questions, which asks you to describe/visualize the data, is to allow you to reflect on the feasibility of the main machine learning aim

Load relevant data

In [None]:
# Load the data

file_path='file_path.csv'
df=pd.read_csv('file_path.csv')  

Print information about the data

In [None]:
# Print information about the data  
print(df.info())
print(df.head())
print(df.describe())
print(df.columns)


# 2. Detailed explanation of the attributes of the data

• Describe if the attributes are discrete/continuous and whether they are nominal/ordinal/interval/ratio.  

• Give an account of whether there are data issues (i.e. missing values or corrupted data) and describe them if so and how you will handle them.  

• Include relevant summary statistics of the attributes. Reflect on the values.  

If your data set contains many similar attributes, you may restrict yourself to describing a few representative features (apply common sense). You can place additional results in the
appendix if needed

Find data issues and clean the data

In [None]:
# Find data issues (i.e. missing values or corrupted data)
print(df.isnull().sum())
# Drop rows with missing values
df = df.dropna()
# Alternatively, you could fill missing values with the mean or median
# df = df.fillna(df.mean())
# df = df.fillna(df.median())
# Encode categorical variables if any
label_encoders = {}
for column in df.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])
    label_encoders[column] = le
# Split data into attribute matrix and target matrix
X = df.drop('target_column', axis=1)  # Replace 'target_column'
y = df['target_column']  # Replace 'target_column'
# with the actual name of your target column

Make summary statistics on attributes

In [None]:
# Include summary statistics of the attributes
print(X.describe())


# 3. Data visualization(s) based on suitable visualization techniques

Touch upon the following aspects, use visualizations when it appears sensible. Keep in mind the ACCENT principles and Tufte’s guidelines when you visualize the data.


• Are there issues with extreme values or outliers in the data?  

• How are the individual attributes distributed (e.g. normally distributed)?  

• Are the attributes correlated?  

There are three aspects that needs to be addressed when you carry out the PCA analysis for the report: 

• The principal directions of the considered PCA components. Plot and interpret the components in terms of the attributes.  

• The amount of variance explained as a function of the number of PCA components included.  

• The data projected onto the considered principal components, e.g. in 2D scatter plots (hint: it may be helpful to color code the points according to the value of the attribute
you wish to predict).  

Hint: If your attributes have very different scales, it may be helpful to standardized the
data prior to the PC