Analyzing data patterns and different classification methods

Table Of Content

Task Description
Feature Extraction and Selection
Classification Algorithm
- Code to save NB in pipeline [Figure 2]
- Code to save Logistic Regression in pipeline [Figure 3]
- Code to save Decision tree in pipeline [Figure 4]
- Code to save LinearSVC in pipeline [Figure 5]
- Pairplot [Figure 1]
Output
- Confusion Matrix of LinearSVC [Figure 6]
- Confusion Matrix of Decision Tree [Figure 7]
- Confusion Matrix of Naive Bayes [Figure 8]
- Confusion Matrix of Logistic Regression [Figure 9]
Accuracy Comparison
- Accuracy of all models Train V Test [Figure 10]
Reference

Task Description

The objective of this assignment was to install scikit-learn library for python, retrieve language dataset from DSL and perform language classification using four different machine learning algorithm. After training the models, we ran accuracy test for all the models in the pipeline to showcase which model had the highest accuracy.

Feature Extraction and Selection

Before extracting the features from the training dataset, we ran PCA to determine which labels are best fit for classification by performing Kbest method using chi-square. The data was dense and it doesn’t really work properly with PCA, so we used truncatedSVD to plot the relation in figure[1]. The plots helps us to understand the label subsequently saving us the time to train the data by eliciting the relation between each feature with other. The plot also helps us to visualize the number of labels present in the data viz. 14.

Classification Algorithm

We were instructed to use four different classification algorithms namely, Naive Bayes, LinearSVC, Decision Tree and Logistic Regression. We loaded all the models in a pipeline and we store it in a python dictionary to loop over them one after the other.

Naive Bayes

We used multinomial NB (Naive Bayes)[3] to perform ‘transform’ on our data and we used countvectorizer, feature selection and the model itself in our pipeline.

Code to save NB in pipeline [Figure 2]

Naive Bayes uses a prior data (known knowledge) to classify the data into different labels. Since the data had more than two labels we went with Multinomial NB algorithm to perform classification.

Logistic Regression

Logistic regression[4] is well known algorithm in classification. We played around with this algorithm and figured that ‘newton-cg’ was solver for this particular dataset. Since, the sigmoid and tanH functions were not giving us the accuracy we were expecting, we used ‘newton-cg’.

Code to save Logistic Regression in pipeline [Figure 3]

Decision Tree Classifier

Usually Decision Tree[1] is used for predictive analysis, but it also has the capability to perform classification. It one of widely used algorithm used in Machine Learning. It is the foundation that runs the random forest classifier.

Code to save Decision tree in pipeline [Figure 4]

LinearSVC

LinearSVC[2] uses the One-vs-All (also known as One-vs-Rest) multiclass reduction. It is also noted here. Also, for multi-class classification problem SVC fits K * (K - 1) / 2 models where K is the amount of classes.

Code to save LinearSVC in pipeline [Figure 5]

Pairplot [Figure 1]

Output

The Confusion Matrix of LinearSVC

Confusion Matrix of LinearSVC [Figure 6]

The confusion Matrix for Decision Tree

Confusion Matrix of Decision Tree [Figure 7]

The confusion Matrix for Naive Bayes

Confusion Matrix of Naive Bayes [Figure 8]

The confusion Matrix for Logistic Regression

Confusion Matrix of Logistic Regression [Figure 9]

Accuracy Comparison

Accuracy of all models Train V Test [Figure 10]

We can clearly see that the accuracy of Naive Bayes was higher during the training however the model didn’t perform that well when it saw unseen data. LinearSVC, on

the other hand, performed consistently across train and test data.

Reference

[1] scikit-learn, "scikit-learn," [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#skle arn.tree.DecisionTreeClassifier. [Accessed 29 07 2018].

[2] "svc," [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html. [Accessed 29 07 2018].

[3] "naivebayes," [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html. [Accessed 29 07 2018].

[4] "logreg," [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.htm l. [Accessed 29 07 18].

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Assignment6.py		Assignment6.py
README.md		README.md
pipeline.py		pipeline.py
seaborn.png		seaborn.png
test-gold.txt		test-gold.txt
train.txt		train.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analyzing data patterns and different classification methods

Table Of Content

Task Description

Feature Extraction and Selection

Classification Algorithm

Pairplot [Figure 1]

Output

Accuracy Comparison

Reference

About

Releases

Packages

Languages

amantewary/Analyzing-data-patterns-and-different-classification-methods

Folders and files

Latest commit

History

Repository files navigation

Analyzing data patterns and different classification methods

Table Of Content

Task Description

Feature Extraction and Selection

Classification Algorithm

Pairplot [Figure 1]

Output

Accuracy Comparison

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages