# Introduction
This notebook contains a text classification tutorial according to:

https://towardsdatascience.com/applying-machine-learning-to-classify-an-unsupervised-text-document-e7bb6265f52

### Imports
Import libraries and write settings here.

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30
pd.options.display.float_format = '{:,.4f}'.format

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

from IPython import get_ipython
ipython = get_ipython()

# autoreload extension
if 'autoreload' not in ipython.extension_manager.loaded:
    %load_ext autoreload

%autoreload 2

# Visualizations
import seaborn as sns
#import plotly.plotly as py
#import plotly.graph_objs as go
#from plotly.offline import iplot, init_notebook_mode
#init_notebook_mode(connected=True)

import cufflinks as cf
cf.go_offline(connected=True)
cf.set_config_file(theme='white')


plotly.graph_objs.YAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.YAxis
  - plotly.graph_objs.layout.scene.YAxis



plotly.graph_objs.XAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.XAxis
  - plotly.graph_objs.layout.scene.XAxis




## Custom imports

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

# Analysis/Modeling
Do work here

Create a text document with two kinds of sentences related with either *cricket* or *travelling*

In [3]:
document = ["This is the most beautiful place in the world.",
            "This man has more skills to show in cricket than any other game.",
            "Hi there! how was your ladakh trip last month?",
            "There was a player who had scored 200 + runs in single cricket innings in his career.",
            "I have got the opportunity to travel to Paris next year for my internship.",
            "May be he is better than you in batting but you are much better than him in bowling.",
            "That was really a great day for me when I was there at Lavasa for the whole night.",
            "That’s exactly I wanted to become, a highest ratting batsmen ever with top scores.",
            "Does it really matter wether you go to Thailand or Goa, its just you have spend your holidays.",
            "Why don’t you go to Switzerland next year for your 25th Wedding anniversary?",
            "Travel is fatal to prejudice, bigotry, and narrow mindedness., and many of our people need it sorely on these accounts.",
            "Stop worrying about the potholes in the road and enjoy the journey.",
            "No cricket team in the world depends on one or two players. The team always plays to win.",
            "Cricket is a team game. If you want fame for yourself, go play an individual game.",
            "Because in the end, you won’t remember the time you spent working in the office or mowing your lawn. Climb that goddamn mountain.",
            "Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]

## Information extraction
In order to extract information from the document we adopt a well known method called **frequency-inverse document frequency (td-idf)**. This method allows to extract words' importance in the document by converting the textual representation into a *Vector Space Model (VSM)*.

**Note:** this approach is used also, for example, in Google searches.

### TF-IDF
TF\*IDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse document frequency (IDF). Each word or term has its respective TF and IDF score. The product of the TF and IDF scores of a term is called the TF*IDF weight of that term.

It is used to weigh a keyword in any content and assign the importance to that keyword based on the number of times it appears in the document:

$$w_{i,j} = tf_{i,j} \times log \left( \frac{N}{df_i} \right)$$

In [4]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(document)

In [5]:
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words='english', strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [6]:
X

<16x92 sparse matrix of type '<class 'numpy.float64'>'
	with 106 stored elements in Compressed Sparse Row format>

## Clustering
Once the information is extracted from the document and it is coded into a numeric vector, it is then possible to proceed with the clustering of sentences based on their content.

In this case we use a simple **K-Means** algorithm.

In [7]:
# Set the number of clusters to output
true_k = 2

# Define the model with initialization
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)

# Fit the model
model.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
       n_clusters=2, n_init=1, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [8]:
# Get the resulting centroids
order_centroids = model.cluster_centers_.argsort()[:, ::-1]

# Get features
terms = vectorizer.get_feature_names()

In [9]:
df = pd.DataFrame(order_centroids, columns=terms , index=['Group_0', 'Group_1']) 
df.T.head(10)

Unnamed: 0,Group_0,Group_1
200,7,12
25th,6,24
accounts,54,77
anniversary,73,91
batsmen,89,80
batting,81,69
beautiful,29,41
better,38,52
bigotry,44,51
bowling,13,34


In [17]:
for i in range(true_k):
    print("\nCluster %d:" % i),
    for ind in order_centroids[i, :20]:
        print(' %s' % terms[ind])


Cluster 0:


(None,)

 better
 beautiful
 place
 sport
 world
 trip
 hi
 ladakh
 month
 day
 night
 great
 lavasa
 team
 enjoy
 potholes
 road
 worrying
 stop
 journey

Cluster 1:


(None,)

 cricket
 game
 team
 year
 travel
 skills
 man
 paris
 opportunity
 internship
 got
 25th
 anniversary
 wedding
 don
 switzerland
 highest
 scores
 batsmen
 ratting


# Results
Once we have trained the algorithm, then we can also predict group's belonging for a new sentence simply by assigning it to the closest centroid.

In [11]:
X = vectorizer.transform(["Nothing is easy in cricket. Maybe when you watch it on TV, it looks easy. But it is not. You have to use your brain and time the ball."])
predicted = model.predict(X)
print(predicted)

[1]


So, here we have got the prediction as [1] which means it belongs to the cluster 1 which is related with Cricket and hence our test sentence is also talking about cricket fact. So our prediction is correct.

In [15]:
X = vectorizer.transform(["Baseball and cricket are different as a sport."])
predicted = model.predict(X)
print(predicted)

[0]


# Conclusions and Next Steps
Summarize findings here