Unstructured data, as the name suggests, does not have a structured format and may contain data such as dates, numbers or facts.

*This results in irregularities and ambiguities which make it difficult to understand using traditional programs when compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.

Source : Wikipedia.
A few examples of unstructured data are:

Emails

Word Processing Files

PDF files

Spreadsheets

Digital Images

Video

Audio

Social Media Posts etc.

In [0]:
import pandas as pd
import csv
#Data Loading
messages = [line.rstrip() for line in open('dataset.csv')]
print len(messages)
#Appending column headers
messages = pd.read_csv('dataset.csv', sep='\t', quoting=csv.QUOTE_NONE,names=["label", "message"])

data_size=messages.shape
print(data_size)

messages_col_names=list(messages.columns)
print(messages_col_names)
print(messages.groupby('label').describe())
print(messages.head(3))
#Identifying the outcome/target variable.
message_target=messages['label'] 
print(message_target)

#Tokenization 
Tokenization is a method to split a sentence/string into substrings. These substrings are called tokens.

In Natural Language Processing (NLP), tokenization is the initial step in preprocessing. Splitting a sentence into tokens helps to remove unwanted information in the raw text such as white spaces, line breaks and so on.
```
import nltk
nltk.download('all')
from nltk.tokenize import word_tokenize
def split_tokens(message):
  message=message.lower()
  message = unicode(message, 'utf8') #convert bytes into proper unicode
  word_tokens =word_tokenize(message)
  return word_tokens
messages['tokenized_message'] = messages.apply(lambda row: split_tokens(row['message']),axis=1)
```

![alt text](https://docs-cdn.fresco.me/system/attachments/files/000/169/041/large/9429cab82d30d31e799a67281382d83a42b2f9f8/lammatize.jpeg)
#Lemmatization
Lemmatization
Lemmatization is a method to convert a word into its base/root form.

Lemmatizer removes affixes of the words present in its dictionary.

```from nltk.stem.wordnet import WordNetLemmatizer
def split_into_lemmas(message):
    lemma = []
    lemmatizer = WordNetLemmatizer()
    for word in message:
        a=lemmatizer.lemmatize(word)
        lemma.append(a)
    return lemma
messages['lemmatized_message'] = messages.apply(lambda row: split_into_lemmas(row['tokenized_message']),axis=1)
print('Tokenized message:',messages['tokenized_message'][11])
print('Lemmatized message:',messages['lemmatized_message'][11])
```

#Stop Word Removal
Stop Word Removal
Stop words are commons words that do not add any relevance for classification (For eg. “the”, “a”, “an”, “in” etc.). Hence, it is essential to remove these words.
```
from nltk.corpus import stopwords
def stopword_removal(message):
    stop_words = set(stopwords.words('english'))
    filtered_sentence = []
    filtered_sentence = ' '.join([word for word in message if word not in stop_words])
    return filtered_sentence
messages['preprocessed_message'] = messages.apply(lambda row: stopword_removal(row['lemmatized_message']),axis=1)
Training_data=pd.Series(list(messages['preprocessed_message']))
Training_label=pd.Series(list(messages['label']))
```
![alt text](https://docs-cdn.fresco.me/system/attachments/files/000/167/877/large/0cef24abdee4953a24ac4f53af51b59aff55dccb/stop_words_removal.jpeg)

#Why Feature Extraction is important?
To perform machine learning on text documents, you first need to turn the text content into numerical feature vectors.

In Python, you have a few packages defined under sklearn.

We will be looking into a few specific ones used for unstructured data.


#Bag Of Words(BOW)

Bag of Words (BOW) is one of the most widely used methods for generating features in Natural Language Processing.

Representing/Transforming a text into a bag of words helps to identify various measures to characterize the text.

Predominantly used for calculating the term(word) frequency or the number of times a term occurs in a document/sentence.

It can be used as a feature for training the classifier.
![alt text](https://docs-cdn.fresco.me/system/attachments/files/000/173/375/large/ea3a2586a5fcfff917fb8bff373bb06f70f71f71/bags_of_words.jpeg)


#Term Document Matrix
Term Document Matrix
The Term Document Matrix (TDM) is a matrix that contains the frequency of occurrence of terms in a collection of documents.
-In a TDM, the rows represent terms and columns represent the documents
```
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
tf_vectorizer = CountVectorizer(ngram_range=(1, 2),min_df = (1/len(Training_label)), max_df = 0.7)
Total_Dictionary_TDM = tf_vectorizer.fit(Training_data)
message_data_TDM = Total_Dictionary_TDM.transform(Training_data)
```

![alt text](https://docs-cdn.fresco.me/system/attachments/files/000/167/821/large/2f3e54bd014f3a596c1cc0dae54240472dde9b59/Term_Document.jpeg)

#Term Frequency Inverse Document Frequency (TFIDF)
Term Frequency Inverse Document Frequency (TFIDF)
In a Term Frequency Inverse Document Frequency (TFIDF) matrix, the term importance is expressed by Inverse Document Frequency (IDF).

IDF diminishes the weight of the most commonly occurring words and increases the weightage of rare words.
```
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2),min_df = (1/len(Training_label)), max_df = 0.7)
Total_Dictionary_TFIDF = tfidf_vectorizer.fit(Training_data)
message_data_TFIDF = Total_Dictionary_TFIDF.transform(Training_data)
Let's take the TDM matrix for further evaluation. You can also try out the same using TFIDF matrix.
```



#How Does a Classifier Work?
How Does a Classifier Work?
The following are the steps involved in building a classification model:

Initialize the classifier to be used.

Train the classifier - All classifiers in scikit-learn uses a fit(X, y) method to fit the model(training) for the given train data X and train label y.

Predict the target - Given an unlabeled observation X, the predict(X) returns the predicted label y.

Evaluate the classifier model - The score(X,y) returns the score for the given test data X and test label y.

#Train and Test Data
The code snippet provided here is for partitioning the data into train and test for building the classifier model. This split will be used to explain classification algorithms.
```
from sklearn.model_selection import train_test_split#Splitting the data for training and testing
train_data,test_data, train_label, test_label = train_test_split(message_data_TDM, Training_label, test_size=.1)
```

#Decision Tree Classification
Decision Tree Classification
It is one of the commonly used classification techniques for performing binary as well as multi-class classification.

The decision tree model predicts the class/target by learning simple decision rules from the features of the data.

```
from sklearn.tree import DecisionTreeClassifier#Creating a decision classifier model
classifier=DecisionTreeClassifier() #Model training
classifier = classifier.fit(train_data, train_label) #After being fitted, the model can then be used to predict the output.
message_predicted_target = classifier.predict(test_data)
score = classifier.score(test_data, test_label)
print('Decision Tree Classifier : ',score)
```

#Stochastic Gradient Descent Classifier
Stochastic Gradient Descent Classifier
It is used for large scale learning

It supports different loss functions & penalties for classification
```
seed=7
from sklearn.linear_model import SGDClassifier
classifier =  SGDClassifier(loss='modified_huber', shuffle=True,random_state=seed)
classifier = classifier.fit(train_data, train_label)
score = classifier.score(test_data, test_label)
print('SGD classifier : ',score)
```

#Support Vector Machine
Support Vector Machine
Support Vector Machine(SVM) is effective in high-dimensional spaces.

It is effective in cases where the number of dimensions is greater than the number of samples.

It works well with a clear margin of separation.
```
from sklearn.svm import SVC
classifier = SVC(kernel="linear", C=0.025,random_state=seed)
classifier = classifier.fit(train_data, train_label)
score = classifier.score(test_data, test_label)
print('SVM Classifier : ',score)
```

#andom Forest Classifier
Random Forest Classifier
Controls over fitting

Here, a random forest fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy.
```
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10,random_state=seed)
classifier = classifier.fit(train_data, train_label)
score = classifier.score(test_data, test_label)
print('Random Forest Classifier : ',score)
7 of 8
```

#Model Tuning
The classification algorithms in machine learning are parameterized. Modifying any of those parameters can influence the results. So algorithm/model tuning is essential to find out the best model.

For example, let's take the Random Forest Classifier and change the values of a few parameters (n_ estimators,max_ features)

```
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=5, n_estimators=15, max_features=60,random_state=seed)
classifier = classifier.fit(train_data, train_label)
score=classifier.score(test_data, test_label)
print('Random Forest classification after model tuning',score)
Refer scikit-learn tutorials and try to change the parameters of other classifiers and analyze the results.
```



#Partitioning the Data
It is a methodological mistake to test and train on the same dataset. This is because the classifier would fail to predict correctly for any unseen data. This could result in overfitting.

To avoid this problem,

Split the data to train set, validation set and test set.

Training Set: The data used to train the classifier.

Validation Set: The data used to tune the classifier model parameters i.e., to understand how well the model has been trained (a part of training data).

Testing Set: The data used to evaluate the performance of the classifier (unseen data by the classifier).

This will help you know the efficiency of your model.

#Cross Validation
Cross validation is a model validation technique to evaluate the performance of a model on unseen data (validation set).

It is a better estimate to evaluate testing accuracy than training accuracy on unseen data.

Points to remember:

Cross validation gives high variance if the testing set and training set are not drawn from same population.

Allowing training data to be included in testing data will not give actual performance results.

In cross validation, the number of samples used for training the model is reduced and the results depend on the choice of the pair of training and testing sets.

You can refer to the various CV approaches here.

#Stratified Shuffle Split
The StratifiedShuffleSplit splits the data by taking an equal number of samples from each class in a random manner.

StratifiedShuffleSplit would suit our case study as the dataset has a class imbalance which can be seen from the following code snippet:
```
seed=7
from sklearn.model_selection import StratifiedShuffleSplit
###cross validation with 10% sample size
sss = StratifiedShuffleSplit(n_splits=1,test_size=0.1, random_state=seed)
sss.get_n_splits(message_data_TDM,Training_label)
print(sss)
test_size=0.1 denotes that 10 % of the dataset is used for testing.
```

#Stratified Shuffle Split Contd...
This selection is then used to split the data into test and train sets.
```
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn import svm
classifiers = [
    DecisionTreeClassifier(),
    SGDClassifier(loss='modified_huber', shuffle=True),
    SVC(kernel="linear", C=0.025),
    KNeighborsClassifier(),
    OneVsRestClassifier(svm.LinearSVC()),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10),
   ]
for clf in classifiers:
    score=0
    for train_index, test_index in sss.split(message_data_TDM,Training_label):
       X_train, X_test = message_data_TDM [train_index], message_data_TDM [test_index]
        y_train, y_test = Training_label[train_index], Training_label[test_index]
        clf.fit(X_train, y_train)
        score=score+clf.score(X_test, y_test)
    print(score)

 ```   
The above code uses ensemble of classifiers for cross validation. It helps to select the best classifier based on the cross validation scores. The classifier with the highest score can be used for building the classification model.

Note: You may add or remove classifiers based on the requirement.

#Classification Accuracy
The classification accuracy is defined as the percentage of correct predictions.

```
from sklearn.metrics import accuracy_score
print('Accuracy Score',accuracy_score(test_label,message_predicted_target))  
classifier = classifier.fit(train_data, train_label)
score=classifier.score(test_data, test_label)
test_label.value_counts()
This simple classification accuracy will not tell us the types of errors by our classifier.
```
It is just an easier method, but it will not give us the latent distribution of response values.

#Confusion Matrix
It is a technique to evaluate the performance of a classifier.

It depicts the performance in a tabular form that has 2 dimensions namely “actual” and “predicted” sets of data.

The rows and columns of the table show the count of false positives, false negatives, true positives and true negatives.

from sklearn.metrics import confusion_matrix
print('Confusion Matrix',confusion_matrix(test_label,message_predicted_target))
The first parameter shows true values and the second parameter shows predicted values.