# CDS Onboarding: How To Use Data

This Jupyter notebook contains the code associated with "How To Use Data in Data Science", a talk given by Alexander Wang to CDS recruits in Fall 2021. The goal of this notebook is to demonstrate:
* How to read raw data from .csv files into Pandas DataFrames,
* How to prepare data for use in machine learning models from a library (Scikit-Learn, PyTorch),
* And how to run machine learning models using the prepared data.

Now let's get started! First, we need to import the libraries used throughout this program.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
print("Successfully completed imports!")

Next, we will read the Iris flowers dataset from the file. The dataset is located at the local path `"datasets/iris.csv"`

In [None]:
raw_data = pd.read_csv('datasets/iris.csv')

# This prints the head() and tail() of the Pandas DataFrame
raw_data

Next, we standardize the columns of the dataframe.

In [None]:
# Create 2 NumPy ndarrays from the feature vector and classification labels of the raw_data respectively. 
raw_X = raw_data[["sepal_length","sepal_width","petal_length","petal_width"]].to_numpy()
y = raw_data["class"].to_numpy()

scaler = StandardScaler().fit(raw_X)
standardized_X = scaler.transform(raw_X)

# Print the head of the array
standardized_X[0:5]

Next, we split the dataset into training and testing sets. Additionally, we parallel shuffle the dataset in order to sufficiently mix up the samples prior to training a model prior to the train-test split.

In [None]:
assert(len(standardized_X) == len(y))

permutation = np.random.permutation(len(y))
standardized_X = standardized_X[permutation]
y = y[permutation]

X_train, X_test, y_train, y_test = train_test_split(standardized_X, y, test_size = 0.4)

print(f"Length of X_train: {len(X_train)}; Length of X_test: {len(X_test)};Length of y_train: {len(y_train)}; Length of y_test: {len(y_test)}")
y_test

Finally, we use X_train and y_train to train a machine learning model. We will be using scikit-learn's implementation of the Logistic Regression classifier model. Although the usage of scikit-learn is beyond the scope of this session, this process will be very similar for any other models in the library. Analogous processes also exist for other machine learning libraries, but they will obviously differ slightly in that they don't make use of the scikit-learn functions.

We will also evaluate the accuracy of the trained model using the X_test and y_test sets. It should produce an accuracy greater than 0.90.

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of LogisticRegression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
#print(y_pred)
#print(y_test)