**1. Feature Reduction**

In the first part of this exercise, you will use the feature_selection.csv dataset. You will fit a logistic regression classifier to this dataset holding out 33% of the data for training. You will experiment with different feature selection and dimensionality reduction techniques.

Fit the data without using any feature reduction
Compute a correlation matrix and drop features that are over 95% correlated
Reduce the number of features using Truncated Singular Value Decomposition (TSVD) at different values of the number of components (2, 5, 10, and 20)
Report the results in a table with the following form.

<img src="files/Images/ex8-1.jpg">

In [34]:
# Load libraries

import numpy as np
import pandas as pd
import warnings
import json

from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.exceptions import DataConversionWarning

# Ignore all future data conversion warnings
warnings.simplefilter(action='ignore', category=(DataConversionWarning, FutureWarning))
#warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
# Read the data in and examine 

feature_selection = pd.read_csv('data/feature_selection.txt', sep = ',', header=0)
feature_selection.info()
feature_selection.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 818 entries, 0 to 817
Columns: 453 entries, ID to J.c9.5
dtypes: float64(450), int64(2), object(1)
memory usage: 2.8+ MB


Unnamed: 0,ID,age,class,A.c1.1,A.c1.2,A.c1.3,A.c1.4,A.c1.5,A.c2.1,A.c2.2,...,J.c8.1,J.c8.2,J.c8.3,J.c8.4,J.c8.5,J.c9.1,J.c9.2,J.c9.3,J.c9.4,J.c9.5
0,5560,3,typ,2.238984,3.238984,4.238984,5.238984,6.238984,2.539386,3.539386,...,2.33063,3.33063,4.33063,5.33063,6.33063,0.105146,1.105146,2.105146,3.105146,4.105146
1,4694,3,typ,1.490947,2.490947,3.490947,4.490947,5.490947,0.692924,1.692924,...,0.033946,1.033946,2.033946,3.033946,4.033946,-0.921489,0.078511,1.078511,2.078511,3.078511
2,6449,3,typ,1.828413,2.828413,3.828413,4.828413,5.828413,2.995978,3.995978,...,-0.309544,0.690456,1.690456,2.690456,3.690456,1.838188,2.838188,3.838188,4.838188,5.838188
3,3008,3,asd,1.930039,2.930039,3.930039,4.930039,5.930039,2.698195,3.698195,...,0.727438,1.727438,2.727438,3.727438,4.727438,2.793029,5.793029,10.793029,17.793029,26.793029
4,3863,3,typ,2.272464,3.272464,4.272464,5.272464,6.272464,1.539144,2.539144,...,2.168858,3.168858,4.168858,5.168858,6.168858,-0.938,0.062,1.062,2.062,3.062


There is one categorical variable, `class` which needs to be converted to continuous variable for model fitting.

In [3]:
# Checking distinct categorical values to create mapping and validate

feature_selection.groupby(['class']).size()

class
asd    180
typ    638
dtype: int64

In [4]:
# Create mapping for categorical variable and replace them in the data

map_class = {'typ': 0, 'asd': 1}
feature_selection['class'] = feature_selection['class'].map(map_class)

# Checking distinct continuous values to create mapping and validate

feature_selection.groupby(['class']).size()

class
0    638
1    180
dtype: int64

In [5]:
# Check in the data contain any null value
# If they do we need to treat them separately

feature_selection.isnull().values.any()

False

There is no `null` value in the data frame. Therefore no need to further seggregation. Continuing to process. Going forth, the categorical dependent variable for the `logistic regression` *(Y)* is the `class` variable, and the independent variables *(Y)* are the rest of the fields in `feature_selection` data frame.

In [6]:
# Define X and Y

category = 'class'
X = feature_selection.drop([category], axis=1)
Y = feature_selection.loc[:, category]


def split_test_train(X, Y, size):
    
    """
    Test train split by holding specified amount of test data
    Args: X (independent variable(s)), Y (Dependent variable(s)), size (to hold)
    Returns: train and test datasets
    """

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=size)

    # Assuming data is normally distributed within each feature, 
    # scale them such that the distribution is now centred around 0, 
    # with a standard deviation of 1

    X_sc = StandardScaler()
    X_train = X_sc.fit_transform(X_train)
    X_test = X_sc.fit_transform(X_test)
    features = len(X_train)
    
    return X_train, X_test, Y_train, Y_test

X_train, X_test, Y_train, Y_test = split_test_train(X, Y, 0.33)

feature_none = len(feature_selection.columns)

In [7]:
# Fitting logistic regression model on train and test data

lr = LogisticRegression(random_state=0)
lr.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [8]:
# Get prediction

Y_pred = lr.predict(X_test)

# Calculate accuracy by creating confusion matrix

conf_matrix = confusion_matrix(Y_test, Y_pred)

# Calculate accuracy

acc_none = lr.score(X_test, Y_test)

print('Confusion Matrix: ')
print(conf_matrix)
print('Model Accuracy Score: ', acc_none)

Confusion Matrix: 
[[209   0]
 [  0  61]]
Model Accuracy Score:  1.0


In [9]:
# Generate classification report and calculate area under ROC

print('Classification Report:')
print(classification_report(Y_test, Y_pred))
print('Area under ROC:')
print(roc_auc_score(Y_test, Y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       209
           1       1.00      1.00      1.00        61

   micro avg       1.00      1.00      1.00       270
   macro avg       1.00      1.00      1.00       270
weighted avg       1.00      1.00      1.00       270

Area under ROC:
1.0


In [10]:
# Remove highly correlated features (95%)
# Generate correlation matrix

corr_matrix = feature_selection.corr().abs()

# Upper traingle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
feature_todrop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Number of features to drop

len(feature_todrop)

# Drop features

feature_selection = feature_selection.drop(feature_todrop, axis=1)

In [11]:
# Split the data (after feature reduction) to test train split

X_train, X_test, Y_train, Y_test = split_test_train(X, Y, 0.33)

# Fitting logistic regression model on train and test data

lr = LogisticRegression(random_state=0)
lr.fit(X_train, Y_train)

# Get prediction

Y_pred = lr.predict(X_test)

# Calculate accuracy

acc_corrdrop = lr.score(X_test, Y_test)

# Calculate number of features
feature_corrdrop = len(feature_selection.columns)

print('Model Accuracy Score: ', acc_corrdrop)

Model Accuracy Score:  1.0


In [31]:
def get_tsvd(df, component):
    """
    Generate tsvd variance for number of components
    Args: dataframe, number of components
    Returns: Number of original features, reduced 
             features, variance ratio and total variance
    """
    
    features = StandardScaler().fit_transform(feature_selection)

    # Make sparse matrix
    features_sparse = csr_matrix(features)

    # Create a TSVD
    tsvd = TruncatedSVD(n_components=component)
    features_sparse_tsvd = tsvd.fit(features_sparse).transform(features_sparse)
    
    feature_orig = features_sparse.shape[1]
    feature_reduced = features_sparse_tsvd.shape[1]
    
    # Creating log model
    lr = LogisticRegression()
    
    # Fit model
    lr.fit(X_train, Y_train)
    
    # Target prediction
    predict = lr.predict(X_test)
    
    acc = accuracy_score(Y_test, predict)
        
    return feature_reduced, acc

In [32]:
output = pd.DataFrame({'Feature Reduction':['None', 
                                            'Drop Correlated Features',
                                            'TSVD (N=2)',
                                            'TSVD (N=5)',
                                            'TSVD (N=10)',
                                            'TSVD (N=20)'
                                           ],
                       '# of input features':[feature_none, 
                                              feature_corrdrop,
                                              get_tsvd(feature_selection, 2)[0],
                                              get_tsvd(feature_selection, 5)[0],
                                              get_tsvd(feature_selection, 10)[0],
                                              get_tsvd(feature_selection, 20)[0]
                                             ],
                       'Accuracy':[acc_none, 
                                   acc_corrdrop, 
                                   get_tsvd(feature_selection, 2)[1],
                                   get_tsvd(feature_selection, 5)[1],
                                   get_tsvd(feature_selection, 10)[1],
                                   get_tsvd(feature_selection, 20)[1]
                                  ]
                      })

print(output)

          Feature Reduction  # of input features  Accuracy
0                      None                  453       1.0
1  Drop Correlated Features                  115       1.0
2                TSVD (N=2)                    2       1.0
3                TSVD (N=5)                    5       1.0
4               TSVD (N=10)                   10       1.0
5               TSVD (N=20)                   20       1.0


**2. Dimensionality Reduction in Text Classification**

In an early lesson, you learned how to perform text classification. In this exercise, you will revisit the categorized-comments.jsonl dataset using dimensionality reduction techniques. For each dimensionality reduction technique, fit a logistic regression model to this dataset holding 33% of the data out for testing. Perform the following dimensionality reduction techniques.

Use the word count vector as input with no dimensionality reduction applied
Use the TF IDF vector as input with no dimensionality reduction applied
Use the word count vector as input with principal components applied for the listed values of N
Use the TF IDF vector as input with truncated singular value decomposition applied for the listed values of N

<img src="files/Images/ex8-2.jpg">

In [35]:
# Read the source data file for Categorized data
file = 'data/reddit/categorized-comments.jsonl'

data = []

with open(file) as f:
    for line in f:
        data.append(json.loads(line))
        
# Convert to Data Frame
category = pd.DataFrame(data)

category.head()

Unnamed: 0,cat,txt
0,sports,Barely better than Gabbert? He was significant...
1,sports,Fuck the ducks and the Angels! But welcome to ...
2,sports,Should have drafted more WRs.\n\n- Matt Millen...
3,sports,[Done](https://i.imgur.com/2YZ90pm.jpg)
4,sports,No!! NOO!!!!!


In [36]:
# Check size of the total data
# Check structure
# Check categories
print('Size: ', len(category), '\n',
      'Shape: ', category.info(), '\n',
      'Categories: ', category.cat.unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2347476 entries, 0 to 2347475
Data columns (total 2 columns):
cat    object
txt    object
dtypes: object(2)
memory usage: 17.9+ MB
Size:  2347476 
 Shape:  None 
 Categories:  ['sports' 'science_and_technology' 'video_games' 'news']


In [37]:
# Since the size is humongus, I will take sample of all 4 categories. 
# By trial, sample of 1000 from each category can be easily handled by my machine
sample = category.groupby('cat').apply(lambda x :x.sample(1000))
del category
sample.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,cat,txt
cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
news,1418084,news,I would beg to differ.
news,2228151,news,"US spends so much, because it's the only milit..."
news,2122333,news,But... are they just finding new customers all...
news,1561279,news,"They renamed the emergency spillway the ""auxil..."
news,2199842,news,any this millennium?
