Word Embeddings - We'll use TF-IDF Vector


- a. __Special character cleaning__: special characters such as “\n” double quotes must be removed from the text since we aren’t expecting any predicting power from them.
- b. __Upcase/downcase__: we would expect, for example, “Book” and “book” to be the same word and have the same predicting power. For that reason we have downcased every word.
- c. __Punctuation signs__: characters such as “?”, “!”, “;” have been removed.
- d. __Possessive pronouns__: in addition, we would expect that “Trump” and “Trump’s” had the same predicting power.
- e. __Stemming or Lemmatization__: stemming is the process of reducing derived words to their root. Lemmatization is the process of reducing a word to its lemma. The main difference between both methods is that lemmatization provides existing words, whereas stemming provides the root, which may not be an existing word. We have used a Lemmatizer based in WordNet.
- f. __Stop words__: words such as “what” or “the” won’t have any predicting power since they will presumably be common to all the documents. For this reason, they may represent noise that can be eliminated. We have downloaded a list of English stop words from the nltk package and then deleted them from the corpus.


In [7]:
import os
import pandas as pd
import numpy as np
import glob
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [8]:
# Loading file from path
def loading_file():
    file_dir = '/home/nbuser/library/1. Classifier/3. Exploratory Data Analysis'        
    file_list = glob.glob(file_dir + '/*.csv')
    csv_file = file_list[0]
    return csv_file

# Import file imto Pandas DataFrame
def importing_file(csv_file):
    df = pd.read_csv(csv_file, sep=",")
    return df

# Saving path
def saving_file(file, file_name, save_dir):
    file.to_csv(os.path.join(save_dir,file_name))


### Importing  file

In [9]:
# Importing file + Loading  file
news_df = importing_file(loading_file())

# Top 5 records
news_df.head()

Unnamed: 0,file_name,title,news_text,category
0,348.txt,Berlin celebrates European cinema,Organisers say this year's Berlin Film Festiva...,entertainment
1,139.txt,U2 to play at Grammy awards show,Irish rock band U2 are to play live at the Gra...,entertainment
2,125.txt,Snow Patrol feted at Irish awards,Snow Patrol were the big winners in Ireland's ...,entertainment
3,267.txt,T in the Park sells out in days,Tickets for Scotland's biggest music festival ...,entertainment
4,311.txt,Corbett attacks 'dumbed-down TV',Ronnie Corbett has joined fellow comedy stars ...,entertainment


In [10]:
news_df['news_text'].apply(lambda x: len(x.split(' '))).sum()

766239

### NLP Pipeline
#### 1 Word Tokenization, 2 Part of Speech, 3 Lemmatization, 4 Stemming, 5 Stop words, 6 Sentence, Tokenization

#### Text Classification With Scikit-Learn

https://e-string.com/articles/text-classification-with-sklearn/

### Clean the text column

In [24]:
import re
def clean_text(x):
    pattern = r'[^a-zA-z0-9\s]'
    text = re.sub(pattern, '', x)
    return text

In [29]:
def clean_all_text(df, column):
    for raw in df[column]:
        cleantext = clean_text(raw)
        df.loc[raw, column] = cleantext
    return df


In [31]:
test_text = news_df.loc[:13, 'news_text']

test_clean = clean_all_text(test_text, 'news_text')
print(test_clean)

KeyError: 'news_text'

### Bag of words, n-grams, tf-idf
https://github.com/RaRe-Technologies/movie-plots-by-genre/blob/master/Document%20classification%20with%20word%20embeddings%20tutorial.ipynb

#### Implementing multi-class text classification with Doc2Vec
https://towardsdatascience.com/implementing-multi-class-text-classification-with-doc2vec-df7c3812824d

#### Multi-Class text classification using XGBoost & others, Doc2Vec, TfIdf

https://github.com/avisheknag17/public_ml_models/blob/master/bbc_articles_text_classification/notebook/text_classification_xgboost_others.ipynb

#### Gensim NLTK
https://github.com/DistrictDataLabs/PyCon2016/tree/master/notebooks/tutorial

In [11]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/nbuser/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [21]:
# We remove stop-words and use NLTK tokenizer then limit our vocabulary to 3k most frequent words.
def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

In [17]:
# Import necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize
  

for raw in news_df.loc[:13,'news_text']:
    text = raw
    # Split scene_one into sentences: sentences
    sentences = sent_tokenize(text)

    # Use word_tokenize to tokenize the fourth sentence: tokenized_sent
    tokenized_sent = word_tokenize(sentences[3])

    # Make a set of unique tokens in the entire scene: unique_tokens
    unique_tokens = set(word_tokenize(text))

#!!!!!!!!!!!!!!!!!!!!!!!
#news['tokens'] = news_df.loc[raw, tokens]
print(unique_tokens)


{'Wham', 'common', 'listed', 'Know', 'na', 'that', 'Moped', '?', 'Surprisingly', 'says', 'reference', 'chart', '8,000', 'commissioned', '``', 'There', 'no', 'This', '75', 'for', 'said', 'Sir', 'earth', 'at', 'rightful', 'into', 'formulaic', 'all', 'our', 'by', 'on', 'should', 'track', 'formula', 'be', 'bells', 'The', 'a', 'Band', 'number', 'Everybody', 'single', 'sleigh', 'so', 'chance', '.', 'top', 'British', 'times', 'Roberts', 'Wine', 'peace', 'book', 'song', 'remake', 'Last', 'recipe', 'this', 'musical', 'is', 'whole', 'linking', 'Aid', 'They', 'revealed', 'Merry', 'set', "''", ',', 'has', 'Number', 'recent', 'elements', 'recording', 'Mr', 'group', 'wishes', 'combine', 'Sunday', 'also', 'Santa', 'office', 'big', 'Cliff', 'Vs', 'title', '20', 'Slade', 'Mistletoe', 'Singles', '-', 'ultimate', 'Christmas', 'years', 'A', 'but', 'place', 'It', 'prank', 'Do', 'as', 'the', 'help', 'Have', 'called', 'Gon', 'Father', 'to', 'charity', 'one', 'lots', 'festive', 'there', 'parties', 'in', 'firs

In [32]:
def print_plot(index):
    example = news_df[news_df.index == index][['news_text', 'category']].values[0]
    if len(example) > 0:
        print(example[0])
        print('Genre:', example[1])

In [33]:
print_plot(13)

US rap star 50 Cent has said he has thrown protege The Game out of his G-Unit gang in a feud that has apparently involved two shootings.In a radio interview on Monday, 50 Cent said the newcomer was disloyal in conflicts with other rappers. A man was shot in the thigh outside New York's Hot 97 studios while 50 Cent was on air. More shots were fired outside his management offices two hours later. 50 Cent appeared on The Game's debut album, which was number one in the US. 50 Cent, whose second album is about to be released after his debut made him one of hip-hop's biggest stars, has been involved in recent rivalries with fellow artists including Fat Joe, Nas and Jadakiss.He has claimed credit for the success of The Game, who has become the hottest new star on the rap scene. Both were drug dealers and were shot before turning to music.In an interview with Hot 97 on Saturday, The Game described some of 50 Cent's rivals as "my friends" and said he would not turn on them. "Nas is one of my fr

In [None]:
print_plot(209)