# LDA

We sugesst to keep this notebook and work on a copy of this file that you can refer to this notebook whenever is necessary.


# 1- First import data and become familiar with the data. 

*To do so we should:* 

- Import required library 

- Import data set and become comftable with data 

In [4]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

# 2-Some Data explanation is in order here **

In [5]:
categories = ['alt.atheism',
              'talk.religion.misc',
              'comp.graphics',
              'sci.space']

num_classes = len(categories)

In [6]:
train=fetch_20newsgroups(subset='train', categories=categories,shuffle=True)
test=fetch_20newsgroups(subset='test', categories=categories,shuffle=True)

No handlers could be found for logger "sklearn.datasets.twenty_newsgroups"


In [7]:
train_data=dict()
test_data=dict()

train_data['target'] = train.target
test_data['target'] = test.target

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
train_data['data'] = vectorizer.fit_transform(train.data)
test_data['data'] = vectorizer.transform(test.data)

with open('train.pkl','wb') as f0:
    pickle.dump(train_data,f0)
f0.close()
with open('test.pkl','wb') as f0:
    pickle.dump(test_data,f0)
f0.close()


# 3-Data Dimension

In [8]:
train_data['data'].shape

(2034, 33809)

In [9]:
feature_dim = train_data['data'].shape[1]
feature_dim

33809

In [133]:
item_index = np.where(train_data['target']==0)

# 4-LDA

*If you start by applying LDA directely on the training data, you will encounter memory crash since the number of feasures is very big. Therefore, a dimension reduction is necessary.*

- We use PCA to reduce the feasure dimension.

- For now set the reduction factor to .005 to reduce the runing time until you make sure your code works. Then set the reduction factor to .03 to see if that improves the model performance.

- Run the following cell to compute the shrunk data set. We use thePCA model built in Scikit-Learn library. 

In [10]:
from sklearn.decomposition import PCA

pca = PCA(n_components=int(.03*feature_dim))
train_data_shrunk = pca.fit(train_data['data'].todense()).transform(train_data['data'].todense())
test_data_shrunk = pca.transform(test_data['data'].todense())

deducted_feature_dim = train_data_shrunk.shape[1]
deducted_feature_dim

1014

*In following cell we are going to compute LDA parameters based on the formulas in (4.36), (4.37) and (4.38). The difference here is that we are not going to compute the denominator in (3.38) since that is not going to change the class score. Please make sure you have understood this. We should compute following parameters:*

- mu: The mean of feasures in each class. 

- Sigma: The covariance matrix of each class. Be aware that in LDA we assumed that the covariance of classes are the same. Therefore, in real application we take the average of the covariance of all classes.

- Pi: The class prior.

- Sigma_ave: The average of covariance matrices of all classes.

- beta: The parameter given in (4.37)

- gamma: The parameter given in (4.36)

In [83]:
def calculate_sigma_and_mu(train_data, train_data_shrunk, category):
    index = np.where(train_data['target'] == category)[0]
    vectors = train_data_shrunk[index]
    mean = vectors.mean(axis=0)
    covariance = np.cov(vectors, rowvar=False)
    return mean, covariance

In [137]:
def calculate_prior(train_data):
    _, y_t = np.unique(train_data['target'], return_inverse=True)
    return np.bincount(y_t) / float(len(train_data['target']))

In [144]:
mu = np.zeros([train_data_shrunk.shape[1],len(categories)], dtype=float)
Sigma = np.zeros([train_data_shrunk.shape[1],train_data_shrunk.shape[1],len(categories)])
Pi = np.zeros(len(categories),dtype=float)

for i in range(0,len(categories)):
    category_mu, category_sigma= calculate_sigma_and_mu(train_data, train_data_shrunk, i)
    mu[:,i] = category_mu
    Sigma[:,:,i] = category_sigma

Sigma_ave = Sigma.mean(axis=2)
Sigma_inv = np.linalg.inv(Sigma_ave)
prior = calculate_prior(train_data)
beta = Sigma_inv.dot(mu)
gamma = np.diag((-1.0/2)*mu.T.dot(Sigma_inv).dot(mu)) + np.log(prior)

*In the following cell we are going to compute:*
- class_scores: The class score of each data point in the test data set. This is the numerator in (4.38) 
- class_prediction: The predicted class based on the class score which is going to be the maximum of all scores of classes.
- accuracy_rate: The rate of accurracy of our LDA implementation

In [149]:
def linear(weight_matrix, vector, bias):
    return np.dot(weight_matrix.T, vector.T).T + bias

In [181]:
class_scores = np.exp(linear(beta, test_data_shrunk, gamma))
class_prediction = np.argmax((class_scores.T / class_scores.sum(axis=1)), axis=0)
accuracy_rate = (class_prediction == test_data['target']).mean()
accuracy_rate

0.89948263118994831

# 5-Compare with Other Linear Models

*In following cell we are going to compare LDA with other linear models built in SciKit-Learn*

In [182]:
from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=4)
neigh.fit(train_data_shrunk,train_data['target']) 
(abs(neigh.predict(test_data_shrunk)-test_data['target'])!=0).mean()


0.16260162601626016