## Lab NLP


# Challenge 1 - Installations-

In [1]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Notes - Video 1
#### Natural Language Processing: Crash Course Computer Science

**Natural Languages:** 
- Presence of linguistic faux pas: slurring words together, mispronounciation, ambiguous phrases that can only be distinguished by context, as well as other factors that add complexity to the data;
- Humans can understand this complexity, but it is a great challenge to have computers understand and speak through natural language.  

**Natural Language Processing (NLP):**
- An interdisciplinary field that combines computer science and linguistics
- There is an infinite way to combine words in a sentence, which makes it impossible to provide a computer with a dictionary of all possible sentences in order to train a machine.
- Deconstruct sentences into bite-sized pieces, which could be more easily processed.

*Parts of speech*: 
- 9 main types in English: Conjunctions, verbs, adjectives, nouns, interjections, pronouns, adverbs, articles, prepositions;
- Divided in subcategories (for example: singular vs plural, superlative vs comparative adverbs, etc).
- Problem: there are words that have multiple meanings (may be noun or verb, for example) - this brings ambiguity, so we also need to teach computers grammar

*Phrase Structure Rules*:
- Also vary from language to language
- Some of the possibilities for English:
    - Sentence = Noun Phrase + Verb Phrase
    - Noun phrase = Article + Noun 
    - Noun phrase = Adjective + Noun
    - Noun phrase = Noun
    - Verb phrase = verb
    - Verb phrase = verb + noun phrase
    - Verb phrase = verb + prepositional phrase
    - Verb phrase = verb + noun phrase + prepositional phrase
    - Prepositional phrase = Preposition + Noun Phrase
    - etc
- Using these rules it becomes easier to construct a PARSE TREE.

*Parse Tree*:
- Tags every word with a likely part of speech
- Also reveals how the sentence is constructed
- Branches come from the classification of individual elements (parts of speech) and join in highest levels to form phrase structures, and then the phrase structures are joined in a sentence.


Computers perform well when you have direct sentences or commands, since they use this parsing structure to decompose and interprete each sentence, but they fail more the more complex your sentence is.

*Knowledge Graph*:
- Facts and relationships between entities that are used to "feed" algorythms in order to form sentences and reproduce natural speech.
- Parsing and generating text are two fundamental components of natural language chatbots.

*Chatbots*:
- Early chatbots were rule-based - hundreds of encoded rules mapping what a user might say to how a program should reply.
- This method is limiting and expensive to maintain
- More recent methods are based in machine learning models that are fed real human-to-human conversations in order to train.

*Speech Recognition Systems*:
- Today, the best speech recognition systembs use deep neural networks
- Conversion of soundwaves into frequencies through an algorythm called a Fast Fourier Transform
- Each fonem has one frequency representation, so when sound is captured, a speech recognition system ends up converting these sound frequencies into words, and it is now a NLP problem.

In [2]:
from nltk.corpus import brown
from nltk import sent_tokenize, word_tokenize

brown.words()[0:10]
brown.tagged_words()[0:10]

text = 'Ironhack is a Global Tech School ranked num 2 worldwide.   Our mission is to help people transform their careers and join a thriving community of tech professionals that love what they do. This ideology is reflected in our teaching practices, which consist of a nine-weeks immersive programming, UX/UI design or Data Analytics course as well as a one-week hiring fair aimed at helping our students change their career and get a job straight after the course. We are present in 8 countries and have campuses in 9 locations - Madrid, Barcelona, Miami, Paris, Mexico City,  Berlin, Amsterdam, Sao Paulo and Lisbon.'

sent_tokenize(text)

['Ironhack is a Global Tech School ranked num 2 worldwide.',
 'Our mission is to help people transform their careers and join a thriving community of tech professionals that love what they do.',
 'This ideology is reflected in our teaching practices, which consist of a nine-weeks immersive programming, UX/UI design or Data Analytics course as well as a one-week hiring fair aimed at helping our students change their career and get a job straight after the course.',
 'We are present in 8 countries and have campuses in 9 locations - Madrid, Barcelona, Miami, Paris, Mexico City,  Berlin, Amsterdam, Sao Paulo and Lisbon.']

In [3]:
word_tokenize(text)

['Ironhack',
 'is',
 'a',
 'Global',
 'Tech',
 'School',
 'ranked',
 'num',
 '2',
 'worldwide',
 '.',
 'Our',
 'mission',
 'is',
 'to',
 'help',
 'people',
 'transform',
 'their',
 'careers',
 'and',
 'join',
 'a',
 'thriving',
 'community',
 'of',
 'tech',
 'professionals',
 'that',
 'love',
 'what',
 'they',
 'do',
 '.',
 'This',
 'ideology',
 'is',
 'reflected',
 'in',
 'our',
 'teaching',
 'practices',
 ',',
 'which',
 'consist',
 'of',
 'a',
 'nine-weeks',
 'immersive',
 'programming',
 ',',
 'UX/UI',
 'design',
 'or',
 'Data',
 'Analytics',
 'course',
 'as',
 'well',
 'as',
 'a',
 'one-week',
 'hiring',
 'fair',
 'aimed',
 'at',
 'helping',
 'our',
 'students',
 'change',
 'their',
 'career',
 'and',
 'get',
 'a',
 'job',
 'straight',
 'after',
 'the',
 'course',
 '.',
 'We',
 'are',
 'present',
 'in',
 '8',
 'countries',
 'and',
 'have',
 'campuses',
 'in',
 '9',
 'locations',
 '-',
 'Madrid',
 ',',
 'Barcelona',
 ',',
 'Miami',
 ',',
 'Paris',
 ',',
 'Mexico',
 'City',
 ',',
 '

## Challenge 2 - Preparing Text Data For Analysis

In [1]:
import re

def cleanup(s):
    """
    Cleans up numbers, URLs, and special characters from a string.

    Args:
        s: The string to be cleaned up.

    Returns:
        A string that has been cleaned up.
    """
    s = re.sub(r'https?://\S+', ' ', s, re.I)
    s = re.sub(r'[\.\,\!\?\"\'\¡\¿\:\#\@\-\)\(]', ' ', s, re.I)
    s = re.sub(r'[0-9]+?', ' ', s)  
    return s

## Tokenization

In [2]:
from nltk.tokenize import word_tokenize

frase = 'ironhack s  q website  is'

def tokenize(s):
    """
    Tokenize a string.

    Args:
        s: String to be tokenized.

    Returns:
        A list of words as the result of tokenization."""
    return word_tokenize(s)

tokenize(frase)

['ironhack', 's', 'q', 'website', 'is']

## Stemming and Lemmatization

In NLTK, there are three stemming libraries: Porter, Snowball, and Lancaster. The difference among the three is the agressiveness with which they perform stemming. Porter is the most gentle stemmer that preserves the word's original form if it has doubts. In contrast, Lancaster is the most aggressive one that sometimes produces wrong outputs. And Snowball is in between. **In most cases you will use either Porter or Snowball**.


In [3]:
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

def stem_and_lemmatize(l):
    """
    Perform stemming and lemmatization on a list of words.

    Args:
        l: A list of strings.

    Returns:
        A list of strings after being stemmed and lemmatized.
    """
    st = PorterStemmer()
    lm = WordNetLemmatizer()
    st_li = [st.stem(a) for a in l]
    lm_li = [lm.lemmatize(a) for a in l]
    return st_li + lm_li


## Stop Words Removal

Stop Words are the most commonly used words in a language that don't contribute to the main meaning of the texts. Examples of English stop words are i, me, is, and, the, but, and here. We want to remove stop words from analysis because otherwise stop words will take the overwhelming portion in our tokenized word list and the NLP algorithms will have problems in identifying the truely important words.

NLTK has a stopwords package that allows us to import the most common stop words in over a dozen langauges including English, Spanish, French, German, Dutch, Portuguese, Italian, etc. These are the bare minimum stop words (100-150 words in each language) that can get beginners started. Some other NLP packages such as stop-words and wordcloud provide bigger lists of stop words.

Now in your Jupyter Notebook, create a function called remove_stopwords that loop through a list of words that have been stemmed and lemmatized to check and remove stop words. Return a new list where stop words have been removed.


In [5]:
from nltk.corpus import stopwords

def remove_stopwords(l):
    """
    Remove English stopwords from a list of strings.

    Args:
        l: A list of strings.

    Returns:
        A list of strings after stop words are removed.
    """
    stop = set(stopwords.words('english'))
    return [w for w in l if w not in stop]

## Challenge 3: Sentiment Analysis

In [6]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

txt = "Ironhack is a Global Tech School ranked num 2 worldwide.   Our mission is to help people transform their careers and join a thriving community of tech professionals that love what they do."
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores(txt)



{'neg': 0.0, 'neu': 0.741, 'pos': 0.259, 'compound': 0.8442}


## Creating Bag of Words

The purpose of this step is to create a bag of words from the processed data. The bag of words contains all the unique words in your whole text body (a.k.a. corpus) with the number of occurrence of each word. It will allow you to understand which words are the most important features across the whole corpus.

Also, you can imagine you will have a massive set of words. The less important words (i.e. those of very low number of occurrence) do not contribute much to the sentiment. Therefore, you only need to use the most important words to build your feature set in the next step. In our case, we will use the top 5,000 words with the highest frequency to build the features.

In your Jupyter Notebook, combine all the words in text_processed and calculate the frequency distribution of all words. A convenient library to calculate the term frequency distribution is NLTK's FreqDist class (documentation). Then select the top 5,000 words from the frequency distribution.



## Testing Naïve Bayes Model

Now we'll test our classifier with the test dataset. This is done by calling nltk.classify.accuracy(classifier, test).

As mentioned in one of the tutorial videos, a Naive Bayes model is considered OK if your accuracy score is over 0.6. If your accuracy score is over 0.7, you've done a great job!



# Bonus Question 1 & 2: Improve Model Performance & Machine Learning Pipeline

If you are still not exhausted so far and want to dig deeper, try to improve your classifier performance. There are many aspects you can dig into, for example:

Improve stemming and lemmatization. Inspect your bag of words and the most important features. Are there any words you should furuther remove from analysis? You can append these words to further remove to the stop words list.

Remember we only used the top 5,000 features to build model? Try using different numbers of top features. The bottom line is to use as few features as you can without compromising your model performance. The fewer features you select into your model, the faster your model is trained. Then you can use a larger sample size to improve your model accuracy score.

In a new Jupyter Notebook, combine all your codes into a function (or a class). Your new function will execute the complete machine learning pipeline job by receiving the dataset location and output the classifier. **This will allow you to use your function to predict the sentiment of any tweet in real time**.
