# Kaggle: Predict the topic of the text 

#### 4 subject data for text classification
Data and instructions can be found here: https://www.kaggle.com/deepak711/4-subject-data-text-classification/data

## 1. Import relevant packages

In [1]:
import pandas as pd
import glob
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

## 2. Extract and store the data

In [2]:
biology = []

for filename in glob.glob("/Users/caroline/Documents/kaggle/train_data_final/biology/*.txt"):
    with open(filename) as f:
        biology.append(f.read())
        
df_biology = pd.DataFrame(biology)
df_biology['subject']='biology'

accounts = []

for filename in glob.glob("/Users/caroline/Documents/kaggle/train_data_final/accounts/*.txt"):
    with open(filename) as f:
        accounts.append(f.read())
        
df_accounts = pd.DataFrame(accounts)
df_accounts['subject']='accounts'

geography = []

for filename in glob.glob("/Users/caroline/Documents/kaggle/train_data_final/geography/*.txt"):
    with open(filename) as f:
        geography.append(f.read())
        
df_geography = pd.DataFrame(geography)
df_geography['subject']='geography'

physics = []

for filename in glob.glob("/Users/caroline/Documents/kaggle/train_data_final/physics/*.txt"):
    with open(filename) as f:
        physics.append(f.read())
        
df_physics = pd.DataFrame(physics)
df_physics['subject']='physics'

data = pd.concat([df_biology, df_accounts, df_geography, df_physics], ignore_index=True, axis = 0)
data = data.rename(columns={0: "text"})
data

Unnamed: 0,text,subject
0,LOCOMOT ION AND MOVEMENT\n\ncalled locomotion....,biology
1,Ernst Mayr\n(1904 — 2004)\n\nBorn on 5 July 19...,biology
2,"TRANSPORT IN PLANTS\n\n- O=""=O\nO==O\n- o==o\n...",biology
3,BIOLOGY\n\nNatural methods work on the princip...,biology
4,MORPHOLOGY OF FLOWERING PLANTS\n\ncoat is the ...,biology
...,...,...
1781,LAWS OF MOTION\n\n \n\nThe total momentum of a...,physics
1782,4O\n\n \n\n \n\ni Physics\n\nEXAMPLE 1.13\n\n ...,physics
1783,I Physics\n\n \n\n7. A changing current in a c...,physics
1784,"222\n\n \n\n \n\ni Physics\n\nNow, let us reco...",physics


## 3 Data preprocessing

### a) Encode subjects as integers

In [3]:
le = preprocessing.LabelEncoder()
le.fit(data.subject)

data['subject_encoded'] = le.transform(data.subject) 
data #use data.subject_encoded.unique() to access the new codes

Unnamed: 0,text,subject,subject_encoded
0,LOCOMOT ION AND MOVEMENT\n\ncalled locomotion....,biology,1
1,Ernst Mayr\n(1904 — 2004)\n\nBorn on 5 July 19...,biology,1
2,"TRANSPORT IN PLANTS\n\n- O=""=O\nO==O\n- o==o\n...",biology,1
3,BIOLOGY\n\nNatural methods work on the princip...,biology,1
4,MORPHOLOGY OF FLOWERING PLANTS\n\ncoat is the ...,biology,1
...,...,...,...
1781,LAWS OF MOTION\n\n \n\nThe total momentum of a...,physics,3
1782,4O\n\n \n\n \n\ni Physics\n\nEXAMPLE 1.13\n\n ...,physics,3
1783,I Physics\n\n \n\n7. A changing current in a c...,physics,3
1784,"222\n\n \n\n \n\ni Physics\n\nNow, let us reco...",physics,3


### b) Split the data into train and test

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    data.text, 
    data.subject_encoded, 
    test_size=0.2, 
    random_state=42)

### c) Create a tfidf model and fit the data

In [5]:
tfidf_model = TfidfVectorizer(
            ngram_range=(1, 2), # unigrams and bigrams
            strip_accents='unicode', # remove accents and perform other character normalization on all characters
            min_df=0.0001, # ignore terms that have a document frequency strictly lower than 0.0001
            lowercase=True) # convert all characters to lowercase before tokenizing.


transformed_train = tfidf_model.fit_transform(X_train)
transformed_test = tfidf_model.transform(X_test) 

## 4. Train model

In [6]:
# Support Vector Machine (SVM): Linear Support Vector Classification

parameters = {
    'C':[0.1, 0.3, 1, 3, 10],
    'loss': ['squared_hinge', 'hinge'],
    'class_weight': [None, 'balanced'],
    'tol': [1e-4, 1e-5, 1e-6]
             
             }

svm = LinearSVC(random_state = 42)

clf = GridSearchCV(svm, parameters)
clf.fit(transformed_train, y_train)



GridSearchCV(cv=None, error_score=nan,
             estimator=LinearSVC(C=1.0, class_weight=None, dual=True,
                                 fit_intercept=True, intercept_scaling=1,
                                 loss='squared_hinge', max_iter=1000,
                                 multi_class='ovr', penalty='l2',
                                 random_state=42, tol=0.0001, verbose=0),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.1, 0.3, 1, 3, 10],
                         'class_weight': [None, 'balanced'],
                         'loss': ['squared_hinge', 'hinge'],
                         'tol': [0.0001, 1e-05, 1e-06]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

## 5. Make the prediction

In [14]:
y_pred = clf.predict(transformed_test)

## 6. Asses the results

### a) Calculate accuracy score

In [15]:
accuracy_score(y_test, y_pred)

0.9916201117318436

### b) Investigate failed predictions 

In [24]:
X_test.loc[y_pred != y_test.values]

938    (v) Colour — some minerals have\ncharacteristi...
67     ENVIRONMENTAL ISSUES\n\n \n\nGreenhouse gases\...
997    LIFE ON THE EARTH\n\nThe producers are consume...
Name: text, dtype: object

In [33]:
# check the labels of the samples where we failed
le.inverse_transform(y_test[y_pred != y_test.values])

array(['geography', 'biology', 'geography'], dtype=object)

In [35]:
# check the predictions of the samples where we failed
le.inverse_transform(y_pred[y_pred != y_test.values])

array(['physics', 'geography', 'biology'], dtype=object)

In [17]:
# print the whole text
for text in X_test.loc[y_pred != y_test.values]:
    print(text)

(v) Colour — some minerals have
characteristic colour determined
by their molecular structure —
malachite. azurite. chalcopyrite etc..
and some minerals are coloured by
impurities. For example. because
of impurities quartz may be white.
green. red. yellow etc.

Streak — colour of the ground powder
of any mineral. It may be of the
same colour as the mineral or may
differ — malachite is green and gives
green streak. ﬂuorite is purple or
green but gives a white streak.
Transparency — transparent: light
rays pass through so that objects
can be seen plainly: translucent
— light rays pass through but will
get diffused so that objects cannot
be seen; opaque — light will not pass
at all.

Structure — particular arrange—
ment of the individual crystals;
fine. medium or coarse grained;
fibrous — separable. divergent.
radiating.

Hardness — relative resistance
being scratched: ten minerals are
selected to measure the degree of
hardness from 1—10. They are:
1_ talc; 2_ gypsum; 3_ calcite; origin. 

In [None]:
# columns corresponds to the predictions
# rows corresponds to the true labels

confusion_matrix(y_test, y_pred)

array([[ 55,   0,   0,   0],
       [  0, 128,   1,   0],
       [  0,   1,  18,   1],
       [  0,   0,   0, 154]])