# Importing packages and dataset

1. Import required packages
2. Import dataset into google drive
3. Mount google drive to colab
4. Read the file using Pandas package method
5. Displaying the First Five rows ( HEAD ) of the dataset

NOTE - 
1. pd is an instance of the pandas package to make it easier to access methods
2. read_csv() - read comma seperated values file

In [None]:
# Import statements
import pandas as pd
import numpy as np 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression,LinearRegression,RidgeClassifier

# Load dataset into a dataframe 
dataset_file = pd.read_csv('/content/drive/My Drive/crx.data',header = None)

# Inspect data
dataset_file.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


# Custom Functions 

To reduce redundacy

In [None]:
# Extras

"""
* Takes an string argument
* EX - printblock("A word or a sentence")
"""
def printblock(out):
  print("----------------------------------------------------------------")
  print(out)
  print("----------------------------------------------------------------")

def accuracy(obj):
  print(round(obj.score(x_test_rescaled,y_test)*100,4),"%\n")

# Understanding the dataset

1. dataset given has meaningless symbols in an attempt to hide the privacy of the data 
2. I have refered to this [link](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html) to better understand the possible attribute values 
3. I found out that these are the possible attributes -
ender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and ApprovalStatus.
4. let us now take a look at the dataset and some information regarding it.

In [None]:
# This method will get statistics on this dataset
printblock("DATASET - SUMMARY")
print(dataset_file.describe(),"\n")

# Dataframe information
printblock("DATAFRAME - INFO")
print(dataset_file.info())

----------------------------------------------------------------
DATASET - SUMMARY
----------------------------------------------------------------
               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000 

----------------------------------------------------------------
DATAFRAME - INFO
----------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non

# Missing Values Handling

1. In the file there are '?' characters which imply mising data 
2. To make that data understandable we will replace them with NaN values

NOTE - NaN - Not a number

In [None]:
# We will use replace function to replace the '?'s 
dataset_file = dataset_file.replace('?',np.NaN)

# fill na values
dataset_file.fillna(dataset_file.mean(),inplace = True)

# NaN count
printblock("NaN count")
dataset_file.isnull().sum()

----------------------------------------------------------------
NaN count
----------------------------------------------------------------


0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

In [None]:
# Looping through columns
for col in dataset_file.columns:
  if dataset_file[col].dtypes == 'object':
    dataset_file = dataset_file.fillna(dataset_file[col].value_counts().index[0])
printblock("RECHECK - NaN's")
print(dataset_file.isnull().sum())

----------------------------------------------------------------
RECHECK - NaN's
----------------------------------------------------------------
0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64


# Preprocessing Data

In [None]:
# LabelEncoder() is used to convert all the data into numerical data
le = LabelEncoder()

# Looping through the dataset and transforming the data into numerical data
for col in dataset_file.columns:
  if dataset_file[col].dtype == 'object' :
    dataset_file[col] = le.fit_transform(dataset_file[col])
dataset_file.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,1,156,0.0,2,1,13,8,1.25,1,1,1,0,0,68,0,0
1,0,328,4.46,2,1,11,4,3.04,1,1,6,0,0,11,560,0
2,0,89,0.5,2,1,11,4,1.5,1,0,0,0,0,96,824,0
3,1,125,1.54,2,1,13,8,3.75,1,1,5,1,0,31,3,0
4,1,43,5.625,2,1,13,8,1.71,1,0,0,0,2,37,0,0


# Test Train Split


In [None]:
# Taking all values into this variable
dataset_file = dataset_file.values

# Splitting the data (table) into two different tables 
  # One with features - x
  # One with Labels - y
x,y = dataset_file[:,0:15], dataset_file[:,15]

# printing info
printblock("Dimentions info : ")
print("Dataset - dataset_file : ",str(dataset_file.shape),"\n")
print("Features - x :",str(x.shape).rjust(20),"\n")
print("Labels - y :",str(y.shape).rjust(19),"\n")

----------------------------------------------------------------
Dimentions info : 
----------------------------------------------------------------
Dataset - dataset_file :  (690, 16) 

Features - x :            (690, 15) 

Labels - y :              (690,) 



In [None]:
# test and train splitting
x_train, x_test,y_train,y_test = train_test_split(x,y,test_size
                                                  =0.20,random_state =20)
printblock("Train Test Splitting Info")
print("Total : 100%     - ",dataset_file.shape[0])
print("Training : 80%   - ",dataset_file.shape[0]*0.8)
print("Testing : 20%    - ",dataset_file.shape[0]*0.2)


----------------------------------------------------------------
Train Test Splitting Info
----------------------------------------------------------------
Total : 100%     -  690
Training : 80%   -  552.0
Testing : 20%    -  138.0


In [None]:
printblock("Training Dimentions")
print("Features : ",x_train.shape)
print("Labels : ",str(y_train.shape).rjust(8))
printblock("Testing Dimentions")
print("Features : ",x_test.shape)
print("Labels : ",str(y_test.shape).rjust(8))

----------------------------------------------------------------
Training Dimentions
----------------------------------------------------------------
Features :  (552, 15)
Labels :    (552,)
----------------------------------------------------------------
Testing Dimentions
----------------------------------------------------------------
Features :  (138, 15)
Labels :    (138,)


In [None]:
# to normalize the data we will use minmaxscaler
scaler = MinMaxScaler(feature_range=(0,1))
x_train_rescaled = scaler.fit_transform(x_train)
x_test_rescaled = scaler.fit_transform(x_test)
printblock("NOW the data is ready to be used in a Regression Model")

----------------------------------------------------------------
NOW the data is ready to be used in a Regression Model
----------------------------------------------------------------


# Fitting the Data to a Regression Model

I will be using three Linear Models to check the accuracy in different models

Referece for the mathematics behind these models - [link](https://scikit-learn.org/stable/modules/linear_model.html)

1. Linear Regression Model
2. Logistic Regression Model
3. Ridge Regression Model

In [None]:
# different model instances and model fit
lin_reg = LinearRegression()
log_reg = LogisticRegression()
rig_cla = RidgeClassifier();
lin_reg.fit(x_train_rescaled,y_train)
log_reg.fit(x_train_rescaled,y_train)
rig_cla.fit(x_train_rescaled,y_train)

RidgeClassifier(alpha=2, class_weight=None, copy_X=True, fit_intercept=True,
                max_iter=None, normalize=False, random_state=None,
                solver='auto', tol=0.001)

# Predictions and Performace

The performance can be changed by tweeking the variables when fitting the model

In [None]:
# linear
pred_y1 = lin_reg.predict(x_test_rescaled)
printblock("Linear Regression Accuracy ")
accuracy(lin_reg)

# logisitic
pred_y2 = log_reg.predict(x_test_rescaled)
printblock("Logistic Regression Accuracy ")
accuracy(log_reg)

# Ridge
pred_y3 = rig_cla.predict(x_test_rescaled)
printblock("Ridge Regression and Classifier Accuracy ")
accuracy(rig_cla)

----------------------------------------------------------------
Linear Regression Accuracy 
----------------------------------------------------------------
60.1213 %

----------------------------------------------------------------
Logistic Regression Accuracy 
----------------------------------------------------------------
86.2319 %

----------------------------------------------------------------
Ridge Regression and Classifier Accuracy 
----------------------------------------------------------------
87.6812 %

