**Introduction to NLP feature engineering**
___
- concepts covered
    - text preprocessing
    - basic features
    - word features
    - vectorization
___

In [None]:
#One-hot encoding

#In the previous exercise, we encountered a dataframe df1 which
#contained categorical features and therefore, was unsuitable for
#applying ML algorithms to.

#In this exercise, your task is to convert df1 into a format that is
#suitable for machine learning.

# Print the features of df1
#print(df1.columns)
#################################################
#<script.py> output:
#    Index(['feature 1', 'feature 2', 'feature 3', 'feature 4', 'feature 5', 'label'], dtype='object')
#################################################

# Perform one-hot encoding
#df1 = pd.get_dummies(df1, columns=['feature 5'])

# Print the new features of df1
#print(df1.columns)

# Print first five rows of df1
#print(df1.head())

#################################################
#Index(['feature 1', 'feature 2', 'feature 3', 'feature 4', 'label', 'feature 5_female', 'feature 5_male'], dtype='object')
#       feature 1  feature 2  feature 3  feature 4  label  feature 5_female  feature 5_male
#    0    29.0000          0          0   211.3375      1                 1               0
#    1     0.9167          1          2   151.5500      1                 0               1
#    2     2.0000          1          2   151.5500      0                 1               0
#    3    30.0000          1          2   151.5500      0                 0               1
#    4    25.0000          1          2   151.5500      0                 1               0
#################################################
#You have successfully performed one-hot encoding on this dataframe.
#Notice how the feature 5 (which represents sex) gets converted to
#two features feature 5_male and feature 5_female. With one-hot
#encoding performed, df1 only contains numerical features and can
#now be fed into any standard ML model!

**Basic feature extraction**
___
- number of characters
- number of words
- average word length
- special features
    - e.g., number of hashtags in a tweet
- other features
    - number of sentences
    - number of paragraphs
    - words starting with an uppercase
    - all-capital words
    - numeric quantities
___

In [None]:
#Character count of Russian tweets

#In this exercise, you have been given a dataframe tweets which
#contains some tweets associated with Russia's Internet Research
#Agency and compiled by FiveThirtyEight.

#Your task is to create a new feature 'char_count' in tweets which
#computes the number of characters for each tweet. Also, compute the
#average length of each tweet. The tweets are available in the
#content feature of tweets.

# Create a feature char_count
#tweets['char_count'] = tweets['content'].apply(len)

# Print the average character count
#print(tweets['char_count'].mean())

#################################################
#<script.py> output:
#    103.462
#################################################
#Notice that the average character count of these tweets is
#approximately 104, which is much higher than the overall average
#tweet length of around 40 characters. Depending on what you're
#working on, this may be something worth investigating into. For
#your information, there is research that indicates that fake news
#articles tend to have longer titles! Therefore, even extremely
#basic features such as character counts can prove to be very useful
#in certain applications.

In [None]:
#Word count of TED talks

#ted is a dataframe that contains the transcripts of 500 TED talks.
#Your job is to compute a new feature word_count which contains the
#approximate number of words for each talk. Consequently, you also
#need to compute the average word count of the talks. The transcripts
#are available as the transcript feature in ted.

#In order to complete this task, you will need to define a function
#count_words that takes in a string as an argument and returns the
#number of words in the string. You will then need to apply this
#function to the transcript feature of ted to create the new feature
#word_count and compute its mean.

# Function that returns number of words in a string
#def count_words(string):
	# Split the string into words
#    words = string.split()

    # Return the number of words
#    return len(words)

# Create a new feature word_count
#ted['word_count'] = ted['transcript'].apply(count_words)

# Print the average word count of the talks
#print(ted['word_count'].mean())

#################################################
#<script.py> output:
#   1987.1
#################################################
#You now know how to compute the number of words in a given piece
#of text. Also, notice that the average length of a talk is close
#to 2000 words. You can use the word_count feature to compute its
#correlation with other variables such as number of views, number
#of comments, etc. and derive extremely interesting insights about
#TED.

In [None]:
#Hashtags and mentions in Russian tweets

#Let's revisit the tweets dataframe containing the Russian tweets.
#In this exercise, you will compute the number of hashtags and
#mentions in each tweet by defining two functions count_hashtags()
#and count_mentions() respectively and applying them to the content
#feature of tweets.

#In case you don't recall, the tweets are contained in the content
#feature of tweets.

# Function that returns numner of hashtags in a string
#def count_hashtags(string):
	# Split the string into words
#    words = string.split()

    # Create a list of words that are hashtags
#    hashtags = [word for word in words if word.startswith('#')]

    # Return number of hashtags
#    return(len(hashtags))

# Create a feature hashtag_count and display distribution
#tweets['hashtag_count'] = tweets['content'].apply(count_hashtags)
#tweets['hashtag_count'].hist()
#plt.title('Hashtag count distribution')
#plt.show()

![_images/19.1.svg](_images/19.1.svg)

In [None]:
# Function that returns number of mentions in a string
#def count_mentions(string):
	# Split the string into words
#    words = string.split()

    # Create a list of words that are mentions
#    mentions = [word for word in words if word.startswith('@')]

    # Return number of mentions
#    return(len(mentions))

# Create a feature mention_count and display distribution
#tweets['mention_count'] = tweets['content'].apply(count_mentions)
#tweets['mention_count'].hist()
#plt.title('Mention count distribution')
#plt.show()

![_images/19.2.svg](_images/19.2.svg)
You now have a good grasp of how to compute various types of
summary features. In the next lesson, we will learn about more
advanced features that are capable of capturing more nuanced
information beyond simple word and character counts.

**Readability tests**
___
- overview of readability tests
    - determine readability of an English passage
    - scale ranging from primary school up to college graduate level
    - a mathematical formula utilizing word, syllable, and sentence count
    - used in fake news and opinion spam detection
- readability text examples
    - **Flesch reading ease**
        - greater the average sentence length, harder text is to read
        - greater the average number of syllables in a word, harder the text is to read
        - higher the score, greater the readability
        ![_images/19.1.PNG](_images/19.1.PNG)
    - **Gunning fog index**
        - developed in 1954
        - also dependent on average sentence length
        - greater the percentage of complex words, harder the text is to read
        - higher the index, lesser the readability
        ![_images/19.2.PNG](_images/19.2.PNG)
    - Simple Measure of Gobbledygook (SMOG)
    - Dale-Chall score
- the textatistic library
    - not available for Anaconda/Windows
    - requires pip install (5GB Visual C++ build tools requirement)
___

In [1]:
#Readability of 'The Myth of Sisyphus'

#In this exercise, you will compute the Flesch reading ease score
#for Albert Camus' famous essay The Myth of Sisyphus. We will then
#interpret the value of this score as explained in the video and
#try to determine the reading level of the essay.

#The entire essay is in the form of a string and is available as
#sisyphus_essay.

sisyphus_essay = '\nThe gods had condemned Sisyphus to ceaselessly rolling a rock to the top of a mountain, whence the stone would fall back of its own weight. They had thought with some reason that there is no more dreadful punishment than futile and hopeless labor. If one believes Homer, Sisyphus was the wisest and most prudent of mortals. According to another tradition, however, he was disposed to practice the profession of highwayman. I see no contradiction in this. Opinions differ as to the reasons why he became the futile laborer of the underworld. To begin with, he is accused of a certain levity in regard to the gods. He stole their secrets. Egina, the daughter of Esopus, was carried off by Jupiter. The father was shocked by that disappearance and complained to Sisyphus. He, who knew of the abduction, offered to tell about it on condition that Esopus would give water to the citadel of Corinth. To the celestial thunderbolts he preferred the benediction of water. He was punished for this in the underworld. Homer tells us also that Sisyphus had put Death in chains. Pluto could not endure the sight of his deserted, silent empire. He dispatched the god of war, who liberated Death from the hands of her conqueror. It is said that Sisyphus, being near to death, rashly wanted to test his wife\'s love. He ordered her to cast his unburied body into the middle of the public square. Sisyphus woke up in the underworld. And there, annoyed by an obedience so contrary to human love, he obtained from Pluto permission to return to earth in order to chastise his wife. But when he had seen again the face of this world, enjoyed water and sun, warm stones and the sea, he no longer wanted to go back to the infernal darkness. Recalls, signs of anger, warnings were of no avail. Many years more he lived facing the curve of the gulf, the sparkling sea, and the smiles of earth. A decree of the gods was necessary. Mercury came and seized the impudent man by the collar and, snatching him from his joys, lead him forcibly back to the underworld, where his rock was ready for him. You have already grasped that Sisyphus is the absurd hero. He is, as much through his passions as through his torture. His scorn of the gods, his hatred of death, and his passion for life won him that unspeakable penalty in which the whole being is exerted toward accomplishing nothing. This is the price that must be paid for the passions of this earth. Nothing is told us about Sisyphus in the underworld. Myths are made for the imagination to breathe life into them. As for this myth, one sees merely the whole effort of a body straining to raise the huge stone, to roll it, and push it up a slope a hundred times over; one sees the face screwed up, the cheek tight against the stone, the shoulder bracing the clay-covered mass, the foot wedging it, the fresh start with arms outstretched, the wholly human security of two earth-clotted hands. At the very end of his long effort measured by skyless space and time without depth, the purpose is achieved. Then Sisyphus watches the stone rush down in a few moments toward tlower world whence he will have to push it up again toward the summit. He goes back down to the plain. It is during that return, that pause, that Sisyphus interests me. A face that toils so close to stones is already stone itself! I see that man going back down with a heavy yet measured step toward the torment of which he will never know the end. That hour like a breathing-space which returns as surely as his suffering, that is the hour of consciousness. At each of those moments when he leaves the heights and gradually sinks toward the lairs of the gods, he is superior to his fate. He is stronger than his rock. If this myth is tragic, that is because its hero is conscious. Where would his torture be, indeed, if at every step the hope of succeeding upheld him? The workman of today works everyday in his life at the same tasks, and his fate is no less absurd. But it is tragic only at the rare moments when it becomes conscious. Sisyphus, proletarian of the gods, powerless and rebellious, knows the whole extent of his wretched condition: it is what he thinks of during his descent. The lucidity that was to constitute his torture at the same time crowns his victory. There is no fate that can not be surmounted by scorn. If the descent is thus sometimes performed in sorrow, it can also take place in joy. This word is not too much. Again I fancy Sisyphus returning toward his rock, and the sorrow was in the beginning. When the images of earth cling too tightly to memory, when the call of happiness becomes too insistent, it happens that melancholy arises in man\'s heart: this is the rock\'s victory, this is the rock itself. The boundless grief is too heavy to bear. These are our nights of Gethsemane. But crushing truths perish from being acknowledged. Thus, Edipus at the outset obeys fate without knowing it. But from the moment he knows, his tragedy begins. Yet at the same moment, blind and desperate, he realizes that the only bond linking him to the world is the cool hand of a girl. Then a tremendous remark rings out: "Despite so many ordeals, my advanced age and the nobility of my soul make me conclude that all is well." Sophocles\' Edipus, like Dostoevsky\'s Kirilov, thus gives the recipe for the absurd victory. Ancient wisdom confirms modern heroism. One does not discover the absurd without being tempted to write a manual of happiness. "What!---by such narrow ways--?" There is but one world, however. Happiness and the absurd are two sons of the same earth. They are inseparable. It would be a mistake to say that happiness necessarily springs from the absurd. Discovery. It happens as well that the felling of the absurd springs from happiness. "I conclude that all is well," says Edipus, and that remark is sacred. It echoes in the wild and limited universe of man. It teaches that all is not, has not been, exhausted. It drives out of this world a god who had come into it with dissatisfaction and a preference for futile suffering. It makes of fate a human matter, which must be settled among men. All Sisyphus\' silent joy is contained therein. His fate belongs to him. His rock is a thing. Likewise, the absurd man, when he contemplates his torment, silences all the idols. In the universe suddenly restored to its silence, the myriad wondering little voices of the earth rise up. Unconscious, secret calls, invitations from all the faces, they are the necessary reverse and price of victory. There is no sun without shadow, and it is essential to know the night. The absurd man says yes and his efforts will henceforth be unceasing. If there is a personal fate, there is no higher destiny, or at least there is, but one which he concludes is inevitable and despicable. For the rest, he knows himself to be the master of his days. At that subtle moment when man glances backward over his life, Sisyphus returning toward his rock, in that slight pivoting he contemplates that series of unrelated actions which become his fate, created by him, combined under his memory\'s eye and soon sealed by his death. Thus, convinced of the wholly human origin of all that is human, a blind man eager to see who knows that the night has no end, he is still on the go. The rock is still rolling. I leave Sisyphus at the foot of the mountain! One always finds one\'s burden again. But Sisyphus teaches the higher fidelity that negates the gods and raises rocks. He too concludes that all is well. This universe henceforth without a master seems to him neither sterile nor futile. Each atom of that stone, each mineral flake of that night filled mountain, in itself forms a world. The struggle itself toward the heights is enough to fill a man\'s heart. One must imagine Sisyphus happy.\n'

# Import Textatistic
from textatistic import Textatistic

# Compute the readability scores
readability_scores = Textatistic(sisyphus_essay).scores

# Print the flesch reading ease score
flesch = readability_scores['flesch_score']
print("The Flesch Reading Ease is %.2f" % (flesch))

#################################################
#You now know to compute the Flesch reading ease score for a
#given body of text. Notice that the score for this essay is
#approximately 81.67. This indicates that the essay is at the
#readability level of a 6th grade American student.

The Flesch Reading Ease is 81.67


In [2]:
#Readability of various publications

#In this exercise, you have been given excerpts of articles from
#four publications. Your task is to compute the readability of these
#excerpts using the Gunning fog index and consequently, determine
#the relative difficulty of reading these publications.

#The excerpts are available as the following strings:

#forbes- An excerpt from an article from Forbes magazine on the
#Chinese social credit score system.

#harvard_law- An excerpt from a book review published in Harvard
#Law Review.

#r_digest- An excerpt from a Reader's Digest article on flight
#turbulence.

#time_kids - An excerpt from an article on the ill effects of salt
#consumption published in TIME for Kids.

forbes = '\nThe idea is to create more transparency about companies and individuals that are breaking the law or are non-compliant with official obligations and incentivize the right behaviors with the overall goal of improving governance and market order. The Chinese Communist Party intends the social credit score system to “allow the trustworthy to roam freely under heaven while making it hard for the discredited to take a single step.” Even though the system is still under development it currently plays out in real life in myriad ways for private citizens, businesses and government officials. Generally, higher credit scores give people a variety of advantages. Individuals are often given perks such as discounted energy bills and access or better visibility on dating websites. Often, those with higher social credit scores are able to forgo deposits on rental properties, bicycles, and umbrellas. They can even get better travel deals. In addition, Chinese hospitals are currently experimenting with social credit scores. A social credit score above 650 at one hospital allows an individual to see a doctor without lining up to pay.\n'
harvard_law = '\nIn his important new book, The Schoolhouse Gate: Public Education, the Supreme Court, and the Battle for the American Mind, Professor Justin Driver reminds us that private controversies that arise within the confines of public schools are part of a broader historical arc — one that tracks a range of cultural and intellectual flashpoints in U.S. history. Moreover, Driver explains, these tensions are reflected in constitutional law, and indeed in the history and jurisprudence of the Supreme Court. As such, debates that arise in the context of public education are not simply about the conflict between academic freedom, public safety, and student rights. They mirror our persistent struggle to reconcile our interest in fostering a pluralistic society, rooted in the ideal of individual autonomy, with our desire to cultivate a sense of national unity and shared identity (or, put differently, our effort to reconcile our desire to forge common norms of citizenship with our fear of state indoctrination and overencroachment). In this regard, these debates reflect the unique role that both the school and the courts have played in defining and enforcing the boundaries of American citizenship. \n'
r_digest = '\nThis week 30 passengers were reportedly injured when a Turkish Airlines flight landing at John F. Kennedy International Airport encountered turbulent conditions. Injuries included bruises, bloody noses, and broken bones. In mid-February, a Delta Airlines flight made an emergency landing to assist three passengers in getting to the nearest hospital after some sudden and unexpected turbulence. Doctors treated 15 passengers after a flight from Miami to Buenos Aires last October for everything from severe bruising to nosebleeds after the plane caught some rough winds over Brazil. In 2016, 23 passengers were injured on a United Airlines flight after severe turbulence threw people into the cabin ceiling. The list goes on. Turbulence has been become increasingly common, with painful outcomes for those on board. And more costly to the airlines, too. Forbes estimates that the cost of turbulence has risen to over $500 million each year in damages and delays. And there are no signs the increase in turbulence will be stopping anytime soon.\n'
time_kids = '\nThat, of course, is easier said than done. The more you eat salty foods, the more you develop a taste for them. The key to changing your diet is to start small. “Small changes in sodium in foods are not usually noticed,” Quader says. Eventually, she adds, the effort will reset a kid’s taste buds so the salt cravings stop. Bridget Murphy is a dietitian at New York University’s Langone Medical Center. She suggests kids try adding spices to their food instead of salt. Eating fruits and veggies and cutting back on packaged foods will also help. Need a little inspiration? Murphy offers this tip: Focus on the immediate effects of a diet that is high in sodium. High blood pressure can make it difficult to be active. “Do you want to be able to think clearly and perform well in school?” she asks. “If you’re an athlete, do you want to run faster?” If you answered yes to these questions, then it’s time to shake the salt habit.\n'


# Import Textatistic
from textatistic import Textatistic

# List of excerpts
excerpts = [forbes, harvard_law, r_digest, time_kids]

# Loop through excerpts and compute gunning fog index
gunning_fog_scores = []
for excerpt in excerpts:
  readability_scores = Textatistic(excerpt).scores
  gunning_fog = readability_scores['gunningfog_score']
  gunning_fog_scores.append(gunning_fog)

# Print the gunning fog indices
print(gunning_fog_scores)

#################################################
#You are now adept at computing readability scores for various
#pieces of text. Notice that the Harvard Law Review excerpt has
#the highest Gunning fog index; indicating that it can be
#comprehended only by readers who have graduated college. On the
#other hand, the Time for Kids article, intended for children, has
#a much lower fog index and can be comprehended by 5th grade students.

[14.436002482929858, 20.735401069518716, 11.085587583148559, 5.926785009861934]


**Tokenization and Lemmatization**
___
- making text machine friendly
    - text preprocessing techniques
        - converting words into lowercase
        - removing leading and trailing whitespaces
        - removing punctuation
        - removing stopwords
        - expanding contractions
        - removing special characters (numbers, emojis, etc.)
- tokenization
- lemmatization
    - convert word into its base form
        - am, is, are -> be
        - reducing, reduces, reduced, reduction -> reduce
        - n't -> not
        - 've -> have
- both tokenization and lemmatization can be done using spaCy library
___

In [3]:
#Tokenizing the Gettysburg Address

#In this exercise, you will be tokenizing one of the most famous
#speeches of all time: the Gettysburg Address delivered by American
#President Abraham Lincoln during the American Civil War.

#The entire speech is available as a string named gettysburg.

gettysburg = "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It's rather for us to be here dedicated to the great task remaining before us - that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion - that we here highly resolve that these dead shall not have died in vain - that this nation, under God, shall have a new birth of freedom - and that government of the people, by the people, for the people, shall not perish from the earth."

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate the tokens
tokens = [token.text for token in doc]
print(tokens)

#################################################
#You now know how to tokenize a piece of text. In the next exercise,
#we will perform similar steps and conduct lemmatization.

['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceived', 'in', 'Liberty', ',', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', 'all', 'men', 'are', 'created', 'equal', '.', 'Now', 'we', "'re", 'engaged', 'in', 'a', 'great', 'civil', 'war', ',', 'testing', 'whether', 'that', 'nation', ',', 'or', 'any', 'nation', 'so', 'conceived', 'and', 'so', 'dedicated', ',', 'can', 'long', 'endure', '.', 'We', "'re", 'met', 'on', 'a', 'great', 'battlefield', 'of', 'that', 'war', '.', 'We', "'ve", 'come', 'to', 'dedicate', 'a', 'portion', 'of', 'that', 'field', ',', 'as', 'a', 'final', 'resting', 'place', 'for', 'those', 'who', 'here', 'gave', 'their', 'lives', 'that', 'that', 'nation', 'might', 'live', '.', 'It', "'s", 'altogether', 'fitting', 'and', 'proper', 'that', 'we', 'should', 'do', 'this', '.', 'But', ',', 'in', 'a', 'larger', 'sense', ',', 'we', 'ca', "n't", 'dedicate', '-', 'we', '

In [5]:
#Lemmatizing the Gettysburg address

#In this exercise, we will perform lemmatization on the same
#gettysburg address from before.

#However, this time, we will also take a look at the speech, before
#and after lemmatization, and try to adjudge the kind of changes
#that take place to make the piece more machine friendly.

gettysburg = "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It's rather for us to be here dedicated to the great task remaining before us - that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion - that we here highly resolve that these dead shall not have died in vain - that this nation, under God, shall have a new birth of freedom - and that government of the people, by the people, for the people, shall not perish from the earth."

# Print the gettysburg address
print(gettysburg)

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate lemmas
lemmas = [token.lemma_ for token in doc]

# Convert lemmas into a string
print(' '.join(lemmas))

#################################################
#You're now proficient at performing lemmatization using spaCy.
#Observe the lemmatized version of the speech. It isn't very
#readable to humans but it is in a much more convenient format for
#a machine to process.

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so no

**Text cleaning**
___
- text cleaning techniques
    - unnecessary whitespaces and escape sequences
    - punctuations
    - special characters (numbers, emojis, etc.)
    - stopwords
- .isalpha() method
    - boolean
    - remove numbers, punctuation, emojis
    - caution: abbreviations or proper nouns with words would be removed
- stopwords
    - words that occur extremely commonly
    - e.g. articles, be verbs, pronouns, etc.
- other text preprocessing techniques
    - removing html/xml tags
    - replacing accented characters
    - correcting spelling errors
- a word of caution
    - always use only those text preprocessing techniques that are relevant to your application
___

In [6]:
#Cleaning a blog post

#In this exercise, you have been given an excerpt from a blog post.
#Your task is to clean this text into a more machine friendly
#format. This will involve converting to lowercase, lemmatization
#and removing stopwords, punctuations and non-alphabetic characters.

#The excerpt is available as a string blog and has been printed
#to the console. The list of stopwords are available as stopwords.

blog = '\nTwenty-first-century politics has witnessed an alarming rise of populism in the U.S. and Europe. The first warning signs came with the UK Brexit Referendum vote in 2016 swinging in the way of Leave. This was followed by a stupendous victory by billionaire Donald Trump to become the 45th President of the United States in November 2016. Since then, Europe has seen a steady rise in populist and far-right parties that have capitalized on Europe’s Immigration Crisis to raise nationalist and anti-Europe sentiments. Some instances include Alternative for Germany (AfD) winning 12.6% of all seats and entering the Bundestag, thus upsetting Germany’s political order for the first time since the Second World War, the success of the Five Star Movement in Italy and the surge in popularity of neo-nazism and neo-fascism in countries such as Hungary, Czech Republic, Poland and Austria.\n'
stopwords = ['fifteen', 'noone', 'whereupon', 'could', 'ten', 'all', 'please', 'indeed', 'whole', 'beside', 'therein', 'using', 'but', 'very', 'already', 'about', 'no', 'regarding', 'afterwards', 'front', 'go', 'in', 'make', 'three', 'here', 'what', 'without', 'yourselves', 'which', 'nothing', 'am', 'between', 'along', 'herein', 'sometimes', 'did', 'as', 'within', 'elsewhere', 'was', 'forty', 'becoming', 'how', 'will', 'other', 'bottom', 'these', 'amount', 'across', 'the', 'than', 'first', 'namely', 'may', 'none', 'anyway', 'again', 'eleven', 'his', 'meanwhile', 'name', 're', 'from', 'some', 'thru', 'upon', 'whither', 'he', 'such', 'down', 'my', 'often', 'whether', 'made', 'while', 'empty', 'two', 'latter', 'whatever', 'cannot', 'less', 'many', 'you', 'ours', 'done', 'thus', 'since', 'everything', 'for', 'more', 'unless', 'former', 'anyone', 'per', 'seeming', 'hereafter', 'on', 'yours', 'always', 'due', 'last', 'alone', 'one', 'something', 'twenty', 'until', 'latterly', 'seems', 'were', 'where', 'eight', 'ourselves', 'further', 'themselves', 'therefore', 'they', 'whenever', 'after', 'among', 'when', 'at', 'through', 'put', 'thereby', 'then', 'should', 'formerly', 'third', 'who', 'this', 'neither', 'others', 'twelve', 'also', 'else', 'seemed', 'has', 'ever', 'someone', 'its', 'that', 'does', 'sixty', 'why', 'do', 'whereas', 'are', 'either', 'hereupon', 'rather', 'because', 'might', 'those', 'via', 'hence', 'itself', 'show', 'perhaps', 'various', 'during', 'otherwise', 'thereafter', 'yourself', 'become', 'now', 'same', 'enough', 'been', 'take', 'their', 'seem', 'there', 'next', 'above', 'mostly', 'once', 'a', 'top', 'almost', 'six', 'every', 'nobody', 'any', 'say', 'each', 'them', 'must', 'she', 'throughout', 'whence', 'hundred', 'not', 'however', 'together', 'several', 'myself', 'i', 'anything', 'somehow', 'or', 'used', 'keep', 'much', 'thereupon', 'ca', 'just', 'behind', 'can', 'becomes', 'me', 'had', 'only', 'back', 'four', 'somewhere', 'if', 'by', 'whereafter', 'everywhere', 'beforehand', 'well', 'doing', 'everyone', 'nor', 'five', 'wherein', 'so', 'amongst', 'though', 'still', 'move', 'except', 'see', 'us', 'your', 'against', 'although', 'is', 'became', 'call', 'have', 'most', 'wherever', 'few', 'out', 'whom', 'yet', 'be', 'own', 'off', 'quite', 'with', 'and', 'side', 'whoever', 'would', 'both', 'fifty', 'before', 'full', 'get', 'sometime', 'beyond', 'part', 'least', 'besides', 'around', 'even', 'whose', 'hereby', 'up', 'being', 'we', 'an', 'him', 'below', 'moreover', 'really', 'it', 'of', 'our', 'nowhere', 'whereby', 'too', 'her', 'toward', 'anyhow', 'give', 'never', 'another', 'anywhere', 'mine', 'herself', 'over', 'himself', 'to', 'onto', 'into', 'thence', 'towards', 'hers', 'nevertheless', 'serious', 'under', 'nine']

import spacy

# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(blog)

# Generate lemmatized tokens
lemmas = [token.lemma_ for token in doc]

# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas
            if lemma.isalpha() and lemma not in stopwords]

# Print string after text cleaning
print(' '.join(a_lemmas))

#################################################
#Take a look at the cleaned text; it is lowercased and devoid of
#numbers, punctuations and commonly used stopwords. Also, note
#that the word U.S. was present in the original text. Since it
#had periods in between, our text cleaning process completely
#removed it. This may not be ideal behavior. It is always
#advisable to use your custom functions in place of isalpha() for
#more nuanced cases.



In [None]:
#Cleaning TED talks in a dataframe

#In this exercise, we will revisit the TED Talks from the first
#chapter. You have been a given a dataframe ted consisting of 5
#TED Talks. Your task is to clean these talks using techniques
#discussed earlier by writing a function preprocess and applying
#it to the transcript feature of the dataframe.

#The stopwords list is available as stopwords.

# Function to preprocess text
#def preprocess(text):
  	# Create Doc object
#    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
#    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
#    a_lemmas = [lemma for lemma in lemmas
#            if lemma.isalpha() and lemma not in stopwords]

#    return ' '.join(a_lemmas)

# Apply preprocess to ted['transcript']
#ted['transcript'] = ted['transcript'].apply(preprocess)
#print(ted['transcript'])

#################################################
#<script.py> output:
#    0     talk new lecture ted illusion create ted try r...
#    1     representation brain brain break left half log...
#    2     great honor today share digital universe creat...
#    3     passion music technology thing combination thi...
#    4     use want computer new program programming requ...
#    5     neuroscientist mixed background physics medici...
#    6     pat mitchell day january begin like work love ...
#    7     taylor wilson year old nuclear physicist littl...
#    8     grow northern ireland right north end absolute...
#    9     publish article new york times modern love col...
#    10    joseph member parliament kenya picture maasai ...
#    11    hi talk little bit music machine life specific...
#    12    hi let ask audience question lie child raise h...
#    13    historical record allow know ancient greeks dr...
#    14    good morning little boy experience change life...
#    15    slide year ago time short slide morning time w...
#    16    like world like share year old love story poor...
#    17    fail woman fail feminist passionate opinion ge...
#    18    revolution century significant longevity revol...
#    19    today baffle lady observe shell soul dwellsand...
#    Name: transcript, dtype: object
#################################################
#You have preprocessed all the TED talk transcripts contained in
#ted and it is now in a good shape to perform operations such as
#vectorization (as we will soon see how). You now have a good
#understanding of how text preprocessing works and why it is
#important. In the next lessons, we will move on to generating
#word level features for our texts.

**Part-of-speech tagging**
___
- Applications
    - word-sense disambiguation
        - "the bear is a majestic animal"
        - "please bear with me"
    - sentiment analysis
    - question answering
    - fake news and opinion spam detection
- POS tagging
    - Assigning every word its corresponding part of speech
        - "Jane is an amazing guitarist"
            - Jane -> proper noun
            - is -> verb
            - an -> determiner
            - amazing -> adjective
            - guitarist -> noun
___

In [7]:
#POS tagging in Lord of the Flies

#In this exercise, you will perform part-of-speech tagging on a
#famous passage from one of the most well-known novels of all time,
#Lord of the Flies, authored by William Golding.

#The passage is available as lotf and has already been printed to
#the console.

lotf = 'He found himself understanding the wearisomeness of this life, where every path was an improvisation and a considerable part of one’s waking life was spent watching one’s feet.'

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(lotf)

# Generate tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

#################################################
#Examine the various POS tags attached to each token and evaluate
#if they make intuitive sense to you. You will notice that they
#are indeed labelled correctly according to the standard rules of
#English grammar.

[('He', 'PRON'), ('found', 'VERB'), ('himself', 'PRON'), ('understanding', 'VERB'), ('the', 'DET'), ('wearisomeness', 'NOUN'), ('of', 'ADP'), ('this', 'DET'), ('life', 'NOUN'), (',', 'PUNCT'), ('where', 'ADV'), ('every', 'DET'), ('path', 'NOUN'), ('was', 'AUX'), ('an', 'DET'), ('improvisation', 'NOUN'), ('and', 'CCONJ'), ('a', 'DET'), ('considerable', 'ADJ'), ('part', 'NOUN'), ('of', 'ADP'), ('one', 'NOUN'), ('’s', 'PART'), ('waking', 'VERB'), ('life', 'NOUN'), ('was', 'AUX'), ('spent', 'VERB'), ('watching', 'VERB'), ('one', 'PRON'), ('’s', 'PART'), ('feet', 'NOUN'), ('.', 'PUNCT')]


In [None]:
#Counting nouns in a piece of text

#In this exercise, we will write two functions, nouns() and
#proper_nouns() that will count the number of other nouns and
#proper nouns in a piece of text respectively.

#These functions will take in a piece of text and generate a list
#containing the POS tags for each word. It will then return the
#number of proper nouns/other nouns that the text contains. We will
#use these functions in the next exercise to generate interesting
#insights about fake news.

#The en_core_web_sm model has already been loaded as nlp in this
#exercise.

#nlp = spacy.load('en_core_web_sm')

# Returns number of proper nouns
#def proper_nouns(text, model=nlp):
  	# Create doc object
#    doc = model(text)
    # Generate list of POS tags
#    pos = [token.pos_ for token in doc]

    # Return number of proper nouns
#    return pos.count('PROPN')

# Returns number of other nouns
#def nouns(text, model=nlp):
  	# Create doc object
#    doc = model(text)
    # Generate list of POS tags
#    pos = [token.pos_ for token in doc]

    # Return number of other nouns
#    return pos.count('NOUN')

#################################################
#You now know how to write functions that compute the number of
#instances of a particular POS tag in a given piece of text. In
#the next exercise, we will use these functions to generate
#features from text in a dataframe.

In [None]:
#Noun usage in fake news

#In this exercise, you have been given a dataframe headlines that
#contains news headlines that are either fake or real. Your task
#is to generate two new features num_propn and num_noun that
#represent the number of proper nouns and other nouns contained
#in the title feature of headlines.

#Next, we will compute the mean number of proper nouns and other
#nouns used in fake and real news headlines and compare the values.
#If there is a remarkable difference, then there is a good chance
#that using the num_propn and num_noun features in fake news
#detectors will improve its performance.

#To accomplish this task, the functions proper_nouns and nouns
#that you had built in the previous exercise have already been
#made available to you.

#headlines['num_propn'] = headlines['title'].apply(proper_nouns)

# Compute mean of proper nouns
#real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()
#fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()

# Print results
#print("Mean no. of proper nouns in real and fake headlines are %.2f and %.2f respectively"%(real_propn, fake_propn))
#################################################
#<script.py> output:
#    Mean no. of proper nouns in real and fake headlines are 2.46
#    and 4.86 respectively
#################################################

#headlines['num_noun'] = headlines['title'].apply(nouns)

# Compute mean of other nouns
#real_noun = headlines[headlines['label'] == 'REAL']['num_noun'].mean()
#fake_noun = headlines[headlines['label'] == 'FAKE']['num_noun'].mean()

# Print results
#print("Mean no. of other nouns in real and fake headlines are %.2f and %.2f respectively"%(real_noun, fake_noun))
#################################################
#<script.py> output:
#    Mean no. of other nouns in real and fake headlines are 2.30
#    and 1.44 respectively
#################################################
#You now know to construct features using POS tags information.
#Notice how the mean number of proper nouns is considerably higher
#for fake news than it is for real news. The opposite seems to be
#true in the case of other nouns. This fact can be put to great
#use in designing fake news detectors.

**Named entity recognition**
___
- Applications
    - efficient search algorithms
    - question answering
    - news article classification
    - customer service
- named entity recognition
    - identifying and classifying named entities into predefined categories
    - categories include person, organization, country, etc.
- a word of caution
    - spaCy's models are not perfect
    - performance is dependent on training and test data
    - train models with specialized data for nuanced cases
    - language specific
___

In [8]:
#Named entities in a sentence

#In this exercise, we will identify and classify the labels of
#various named entities in a body of text using one of spaCy's
#statistical models. We will also verify the veracity of these
#labels.

import spacy

# Load the required model
nlp = spacy.load('en_core_web_sm')

# Create a Doc instance
text = 'Sundar Pichai is the CEO of Google. Its headquarters is in Mountain View.'
doc = nlp(text)

# Print all named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)

#################################################
#Notice how the model correctly predicted the labels of Google
#and Mountain View but [mislabeled Sundar Pichai as an organization.]
#{this was according to an earlier version of en_core_web_sm. As you
#can see here, he is correctly labeled as a PERSON now}. As
#discussed in the video, the predictions of the model depend
#strongly on the data it is trained on. It is possible to train
#spaCy models on your custom data. You will learn to do this in
#more advanced NLP courses.

Sundar Pichai PERSON
Google ORG
Mountain View GPE


In [9]:
#Identifying people mentioned in a news article

#In this exercise, you have been given an excerpt from a news
#article published in TechCrunch. Your task is to write a
#function find_people that identifies the names of people that
#have been mentioned in a particular piece of text. You will then
#use find_people to identify the people of interest in the article.

#The article is available as the string tc and has been printed
#to the console. The required spacy model has also been already
#loaded as nlp.

import spacy

# Load the required model
nlp = spacy.load('en_core_web_sm')

tc = "\nIt’s' been a busy day for Facebook  exec op-eds. Earlier this morning, Sheryl Sandberg broke the site’s silence around the Christchurch massacre, and now Mark Zuckerberg is calling on governments and other bodies to increase regulation around the sorts of data Facebook traffics in. He’s hoping to get out in front of heavy-handed regulation and get a seat at the table shaping it.\n"

def find_persons(text):
  # Create Doc object
  doc = nlp(text)

  # Identify the persons
  persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']

  # Return persons
  return persons

print(find_persons(tc))

#################################################
#The article was related to Facebook and our function correctly
#identified both the people mentioned. You can now see how NER
#could be used in a variety of applications. Publishers may use
#a technique like this to classify news articles by the people
#mentioned in them. A question answering system could also use
#something like this to answer questions such as 'Who are the
#people mentioned in this passage?'. With this, we come to an
#end of this chapter. In the next, we will learn how to conduct
#vectorization on documents.

['Sheryl Sandberg', 'Mark Zuckerberg']


**Building a bag of words model**
___
- bag of words model
    - extract word tokens
    - compute frequency of word tokens
    - construct a word vector of these frequencies and vocabulary of corpus
___

In [None]:
#BoW model for movie taglines

#In this exercise, you have been provided with a corpus of more
#than 7000 movie tag lines. Your job is to generate the bag of
#words representation bow_matrix for these taglines. For this
#exercise, we will ignore the text preprocessing step and generate
#bow_matrix directly.

#We will also investigate the shape of the resultant bow_matrix.
#The first five taglines in corpus have been printed to the
#console for you to examine.

# Import CountVectorizer
#from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
#vectorizer = CountVectorizer()

# Generate matrix of word vectors
#bow_matrix = vectorizer.fit_transform(corpus)

# Print the shape of bow_matrix
#print(bow_matrix.shape)

#################################################
#<script.py> output:
#    (7033, 6614)
#################################################
#ou now know how to generate a bag of words representation for a
#given corpus of documents. Notice that the word vectors created
#have more than 6600 dimensions. However, most of these dimensions
#have a value of zero since most words do not occur in a particular
#tagline.

In [None]:
#Analyzing dimensionality and preprocessing

#In this exercise, you have been provided with a lem_corpus which
#contains the pre-processed versions of the movie taglines from
#the previous exercise. In other words, the taglines have been
#lowercased and lemmatized, and stopwords have been removed.

#Your job is to generate the bag of words representation
#bow_lem_matrix for these lemmatized taglines and compare its
#shape with that of bow_matrix obtained in the previous exercise.
#The first five lemmatized taglines in lem_corpus have been printed
#to the console for you to examine.

# Import CountVectorizer
#from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
#vectorizer = CountVectorizer()

# Generate matrix of word vectors
#bow_lem_matrix = vectorizer.fit_transform(lem_corpus)

# Print the shape of bow_lem_matrix
#print(bow_lem_matrix.shape)

#################################################
#<script.py> output:
#    (6959, 5223)
#################################################
#Notice how the number of features have reduced significantly
#from around 6600 to around 5223 for pre-processed movie taglines.
#The reduced number of dimensions on account of text preprocessing
#usually leads to better performance when conducting machine
#learning and it is a good idea to consider it. However, as
#mentioned in a previous lesson, the final decision always depends
#on the nature of the application.

In [10]:
#Mapping feature indices with feature names

#In the lesson video, we had seen that CountVectorizer doesn't
#necessarily index the vocabulary in alphabetical order. In this
#exercise, we will learn to map each feature index to its
#corresponding feature name from the vocabulary.

#We will use the same three sentences on lions from the video.
#The sentences are available in a list named corpus and has
#already been printed to the console.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['The lion is the king of the jungle',
 'Lions have lifespans of a decade',
 'The lion is an endangered species']

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

# Convert bow_matrix into a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray())

# Map the column names to vocabulary
bow_df.columns = vectorizer.get_feature_names()

# Print bow_df
print(bow_df)

#################################################
#Observe that the column names refer to the token whose frequency
#is being recorded. Therefore, since the first column name is an,
#the first feature represents the number of times the word 'an'
#occurs in a particular sentence. get_feature_names() essentially
#gives us a list which represents the mapping of the feature
#indices to the feature name in the vocabulary.

   an  decade  endangered  have  is  jungle  king  lifespans  lion  lions  of  \
0   0       0           0     0   1       1     1          0     1      0   1   
1   0       1           0     1   0       0     0          1     0      1   1   
2   1       0           1     0   1       0     0          0     1      0   0   

   species  the  
0        0    3  
1        0    0  
2        1    1  


**Building a BoW Naive Bayes classifier**
___
- Steps
    - text preprocessing
    - building a bag-of-words model (or representation)
    - machine learning
___

In [None]:
#BoW vectors for movie reviews

#In this exercise, you have been given two pandas Series, X_train
#and X_test, which consist of movie reviews. They represent the
#training and the test review data respectively. Your task is to
#preprocess the reviews and generate BoW vectors for these two
#sets using CountVectorizer.

#Once we have generated the BoW vector matrices X_train_bow and
#X_test_bow, we will be in a very good position to apply a machine
#learning model to it and conduct sentiment analysis.

# Import CountVectorizer
#from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer object
#vectorizer = CountVectorizer(lowercase=True, stop_words='english')

# Fit and transform X_train
#X_train_bow = vectorizer.fit_transform(X_train)

# Transform X_test
#X_test_bow = vectorizer.transform(X_test)

# Print shape of X_train_bow and X_test_bow
#print(X_train_bow.shape)
#print(X_test_bow.shape)

#################################################
#<script.py> output:
#    (250, 8158)
#    (750, 8158)
#################################################
#You now have a good idea of preprocessing text and transforming
#them into their bag-of-words representation using CountVectorizer.
#In this exercise, you have set the lowercase argument to True.
#However, note that this is the default value of lowercase and
#passing it explicitly is not necessary. Also, note that both
#X_train_bow and X_test_bow have 8158 features. There were words
#present in X_test that were not in X_train. CountVectorizer chose
#to ignore them in order to ensure that the dimensions of both
#sets remain the same.

In [None]:
#Predicting the sentiment of a movie review

#In the previous exercise, you generated the bag-of-words
#representations for the training and test movie review data. In
#this exercise, we will use this model to train a Naive Bayes
#classifier that can detect the sentiment of a movie review and
#compute its accuracy. Note that since this is a binary
#classification problem, the model is only capable of classifying
#a review as either positive (1) or negative (0). It is incapable
#of detecting neutral reviews.

#In case you don't recall, the training and test BoW vectors are
#available as X_train_bow and X_test_bow respectively. The
#corresponding labels are available as y_train and y_test
#respectively. Also, for you reference, the original movie
#review dataset is available as df.

# Create a MultinomialNB object
#clf = MultinomialNB()

# Fit the classifier
#clf.fit(X_train_bow, y_train)

# Measure the accuracy
#accuracy = clf.score(X_test_bow, y_test)
#print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
#review = "The movie was terrible. The music was underwhelming and the acting mediocre."
#prediction = clf.predict(vectorizer.transform([review]))[0]
#print("The sentiment predicted by the classifier is %i" % (prediction))

#################################################
#<script.py> output:
#    The accuracy of the classifier on the test set is 0.732
#    The sentiment predicted by the classifier is 0
#################################################
#You have successfully performed basic sentiment analysis. Note
#that the accuracy of the classifier is 73.2%. Considering the
#fact that it was trained on only 750 reviews, this is reasonably
#good performance. The classifier also correctly predicts the
#sentiment of a mini negative review which we passed into it.

**Building n-gram models**
___
![_images/19.3.PNG](_images/19.3.PNG)
- n-grams
    - contiguous sequence of n elements (or words) in a given document
    - n = 1 -> bag-of-words
    - captures more context
    - applications
        - sentence completion
        - spelling correction
        - machine translation correction
    - shortcomings
        - curse of dimensionality
        - higher order n-grams are rare
        - keep n small
___

In [None]:
#n-gram models for movie tag lines

#In this exercise, we have been provided with a corpus of more
#than 9000 movie tag lines. Our job is to generate n-gram models
#up to n equal to 1, n equal to 2 and n equal to 3 for this data
#and discover the number of features for each model.

#We will then compare the number of features generated for each
#model.

# Generate n-grams upto n=1
#vectorizer_ng1 = CountVectorizer(ngram_range=(1,1))
#ng1 = vectorizer_ng1.fit_transform(corpus)

# Generate n-grams upto n=2
#vectorizer_ng2 = CountVectorizer(ngram_range=(1,2))
#ng2 = vectorizer_ng2.fit_transform(corpus)

# Generate n-grams upto n=3
#vectorizer_ng3 = CountVectorizer(ngram_range=(1,3))
#ng3 = vectorizer_ng3.fit_transform(corpus)

# Print the number of features for each model
#print("ng1, ng2 and ng3 have %i, %i and %i features respectively" % (ng1.shape[1], ng2.shape[1], ng3.shape[1]))

#################################################
#<script.py> output:
#    ng1, ng2 and ng3 have 6614, 37100 and 76881 features respectively
#################################################
#You now know how to generate n-gram models containing higher
#order n-grams. Notice that ng2 has over 37,000 features whereas
#ng3 has over 76,000 features. This is much greater than the
#6,000 dimensions obtained for ng1. As the n-gram range increases,
#so does the number of features, leading to increased computational
#costs and a problem known as the curse of dimensionality.

In [None]:
#Higher order n-grams for sentiment analysis

#Similar to a previous exercise, we are going to build a
#classifier that can detect if the review of a particular movie
#is positive or negative. However, this time, we will use n-grams
#up to n=2 for the task.

#The n-gram training reviews are available as X_train_ng. The
#corresponding test reviews are available as X_test_ng. Finally,
#use y_train and y_test to access the training and test sentiment
#classes respectively.

# Define an instance of MultinomialNB
#clf_ng = MultinomialNB()

# Fit the classifier
#clf_ng.fit(X_train_ng, y_train)

# Measure the accuracy
#accuracy = clf_ng.score(X_test_ng, y_test)
#print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
#review = "The movie was not good. The plot had several holes and the acting lacked panache."
#prediction = clf_ng.predict(ng_vectorizer.transform([review]))[0]
#print("The sentiment predicted by the classifier is %i" % (prediction))

#################################################
#<script.py> output:
#    The accuracy of the classifier on the test set is 0.758
#    The sentiment predicted by the classifier is 0
#################################################
#You're now adept at performing sentiment analysis using text.
#Notice how this classifier performs slightly better than the
#BoW version. Also, it succeeds at correctly identifying the
#sentiment of the mini-review as negative. In the next chapter,
#we will learn more complex methods of vectorizing textual data.

In [None]:
#Comparing performance of n-gram models

#You now know how to conduct sentiment analysis by converting
#text into various n-gram representations and feeding them to a
#classifier. In this exercise, we will conduct sentiment analysis
#for the same movie reviews from before using two n-gram models:
#unigrams and n-grams up to n equal to 3.

#We will then compare the performance using three criteria:
#accuracy of the model on the test set, time taken to execute
#the program and the number of features created when generating
#the n-gram representation.

#start_time = time.time()
# Splitting the data into training and test sets
#train_X, test_X, train_y, test_y = train_test_split(df['review'], df['sentiment'], test_size=0.5, random_state=42, stratify=df['sentiment'])

# Generating ngrams
#vectorizer = CountVectorizer(ngram_range=(1,1))
#train_X = vectorizer.fit_transform(train_X)
#test_X = vectorizer.transform(test_X)

# Fit classifier
#clf = MultinomialNB()
#clf.fit(train_X, train_y)

# Print accuracy, time and number of dimensions
#print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. The ngram representation had %i features." % (time.time() - start_time, clf.score(test_X, test_y), train_X.shape[1]))

#################################################
#<script.py> output:
#    The program took 0.168 seconds to complete. The accuracy on the test set is 0.75. The ngram representation had 12347 features.
#################################################

In [None]:
#start_time = time.time()
# Splitting the data into training and test sets
#train_X, test_X, train_y, test_y = train_test_split(df['review'], df['sentiment'], test_size=0.5, random_state=42, stratify=df['sentiment'])

# Generating ngrams
#vectorizer = CountVectorizer(ngram_range=(1,3))
#train_X = vectorizer.fit_transform(train_X)
#test_X = vectorizer.transform(test_X)

# Fit classifier
#clf = MultinomialNB()
#clf.fit(train_X, train_y)

# Print accuracy, time and number of dimensions
#print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. The ngram representation had %i features." % (time.time() - start_time, clf.score(test_X, test_y), train_X.shape[1]))

#################################################
#<script.py> output:
#    The program took 2.391 seconds to complete. The accuracy on the test set is 0.77. The ngram representation had 178240 features.
#################################################
#The program took around 0.2 seconds in the case of the unigram
#model and more than 10 times longer for the higher order n-gram
#model. The unigram model had over 12,000 features whereas the
#n-gram model for upto n=3 had over 178,000! Despite taking
#higher computation time and generating more features, the
#classifier only performs marginally better in the latter case,
#producing an accuracy of 77% in comparison to the 75% for the
#unigram model.

**Building tf-idf document vectors**
___
- Applications
    - automatically detect stopwords
    - search
    - recommender systems
    - better performance in predictive modeling for some cases
- term frequency-inverse document frequency
    - proportional to term frequency
    - inverse function of the number of documents in which it occurs
    ![_images/19.4.PNG](_images/19.4.PNG)
___

In [None]:
#tf-idf vectors for TED talks

#In this exercise, you have been given a corpus ted which contains
#the transcripts of 500 TED Talks. Your task is to generate the
#tf-idf vectors for these talks.

#In a later lesson, we will use these vectors to generate
#recommendations of similar talks based on the transcript.

# Import TfidfVectorizer
#from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
#vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
#tfidf_matrix = vectorizer.fit_transform(ted)

# Print the shape of tfidf_matrix
#print(tfidf_matrix.shape)

#################################################
#<script.py> output:
#    (500, 29158)
#################################################
#You now know how to generate tf-idf vectors for a given corpus
#of text. You can use these vectors to perform predictive
#modeling just like we did with CountVectorizer. In the next few
#lessons, we will see another extremely useful application of the
#vectorized form of documents: generating recommendations.

**Cosine similarity**
___
![_images/19.5.PNG](_images/19.5.PNG)
![_images/19.6.PNG](_images/19.6.PNG)
![_images/19.7.PNG](_images/19.7.PNG)
![_images/19.8.PNG](_images/19.8.PNG)
- Cosine Score: points to remember
    - value between -1 and 1
    - in NLP value between 0 and 1
        - no similarity -> identical
    - robust to document length
___

In [11]:
#Computing dot product

#In this exercise, we will learn to compute the dot product
#between two vectors, A = (1, 3) and B = (-2, 2), using the numpy
#library. More specifically, we will use the np.dot() function to
#compute the dot product of two numpy arrays.

import numpy as np
# Initialize numpy vectors
A = np.array([1,3])
B = np.array([-2, 2])

# Compute dot product
dot_prod = np.dot(A, B)

# Print dot product
print(dot_prod)

#################################################
#The dot product of the two vectors is 1 * -2 + 3 * 2 = 4, which
#is indeed the output produced. We will not be using np.dot()
#too much in this course but it can prove to be a helpful
#function while computing dot products between two standalone
#vectors.

4


In [12]:
#Cosine similarity matrix of a corpus

#In this exercise, you have been given a corpus, which is a list
#containing five sentences. The corpus is printed in the console.
#You have to compute the cosine similarity matrix which contains
#the pairwise cosine similarity score for every pair of sentences
#(vectorized using tf-idf).

#Remember, the value corresponding to the ith row and jth column
#of a similarity matrix denotes the similarity score for the ith
#and jth vector.

corpus = ['The sun is the largest celestial body in the solar system',
 'The solar system consists of the sun and eight revolving planets',
 'Ra was the Egyptian Sun God',
 'The Pyramids were the pinnacle of Egyptian architecture',
 'The quick brown fox jumps over the lazy dog']

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix,tfidf_matrix)
print(cosine_sim)

#################################################
#As you will see in a subsequent lesson, computing the cosine
#similarity matrix lies at the heart of many practical systems
#such as recommenders. From our similarity matrix, we see that
#the first and the second sentence are the most similar. Also
#the fifth sentence has, on average, the lowest pairwise cosine
#scores. This is intuitive as it contains entities that are not
#present in the other sentences.

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]


**Building a plot line based recommender**
___
- Steps
    - text preprocessing
    - generate tf-idf vectors
    - generate cosine similarity matrix
- the recommender function
    - take a movie title, cosine similarity matrix and indices series as arguments
    - extract pairwise cosine similarity scores for the movie
    - sort the scores in descending order
    - output titles corresponding to the highest scores
        - ignore the highest similarity score of 1 (same title)
- the linear_kernel function
    - the magnitude of a tf-idf vector is 1
        - therefore, cosine score between two tf-idf vectors is their dot product
        - can significantly improve computation time
        - use linear_kernel instead of cosine_similarity
___

In [None]:
# Record start time
#start = time.time()

# Compute cosine similarity matrix
#cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
#print(cosine_sim)

# Print time taken
#print("Time taken: %s seconds" %(time.time() - start))

#################################################
#<script.py> output:
#    [[1.         0.         0.         ... 0.         0.         0.        ]
#     [0.         1.         0.         ... 0.         0.         0.        ]
#     [0.         0.         1.         ... 0.         0.01418221 0.        ]
#     ...
#     [0.         0.         0.         ... 1.         0.01589009 0.        ]
#     [0.         0.         0.01418221 ... 0.01589009 1.         0.        ]
#     [0.         0.         0.         ... 0.         0.         1.        ]]
#    Time taken: 0.32429075241088867 seconds
#################################################

In [None]:
# Record start time
#start = time.time()

# Compute cosine similarity matrix
#cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
#print(cosine_sim)

# Print time taken
#print("Time taken: %s seconds" %(time.time() - start))

#################################################
#<script.py> output:
#    [[1.         0.         0.         ... 0.         0.         0.        ]
#     [0.         1.         0.         ... 0.         0.         0.        ]
#     [0.         0.         1.         ... 0.         0.01418221 0.        ]
#     ...
#     [0.         0.         0.         ... 1.         0.01589009 0.        ]
#     [0.         0.         0.01418221 ... 0.01589009 1.         0.        ]
#     [0.         0.         0.         ... 0.         0.         1.        ]]
#    Time taken: 0.3089025020599365 seconds
#################################################
#Notice how both linear_kernel and cosine_similarity produced
#the same result. However, linear_kernel took a smaller amount
#of time to execute. When you're working with a very large
#amount of data and your vectors are in the tf-idf representation,
#it is good practice to default to linear_kernel to improve
#performance. (NOTE: In case, you see linear_kernel taking more
#time, it's because the dataset we're dealing with is extremely
#small and Python's time module is incapable of capture such
#minute time differences accurately)

In [None]:
#Plot recommendation engine

#In this exercise, we will build a recommendation engine that
#suggests movies based on similarity of plot lines. You have
#been given a get_recommendations() function that takes in the
#title of a movie, a similarity matrix and an indices series as
#its arguments and outputs a list of most similar movies. indices
#has already been provided to you.

#You have also been given a movie_plots Series that contains the
#plot lines of several movies. Your task is to generate a cosine
#similarity matrix for the tf-idf vectors of these plots.

#Consequently, we will check the potency of our engine by
#generating recommendations for one of my favorite movies, The
#Dark Knight Rises.

# Initialize the TfidfVectorizer
#tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
#tfidf_matrix = tfidf.fit_transform(movie_plots)

# Generate the cosine similarity matrix
#cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Generate recommendations
#print(get_recommendations('The Dark Knight Rises', cosine_sim, indices))

#################################################
#<script.py> output:
#    1                              Batman Forever
#    2                                      Batman
#    3                              Batman Returns
#    8                  Batman: Under the Red Hood
#    9                            Batman: Year One
#    10    Batman: The Dark Knight Returns, Part 1
#    11    Batman: The Dark Knight Returns, Part 2
#    5                Batman: Mask of the Phantasm
#    7                               Batman Begins
#    4                              Batman & Robin
#    Name: title, dtype: object
#################################################
#You've just built your very first recommendation system. Notice
#how the recommender correctly identifies 'The Dark Knight Rises'
#as a Batman movie and recommends other Batman movies as a result.
#This system is, of course, very primitive and there are a host
#of ways in which it could be improved. One method would be to
#look at the cast, crew and genre in addition to the plot to
#generate recommendations. We will not be covering this in this
#course but you have all the tools necessary to accomplish this.
#Do give it a try!

In [None]:
#The recommender function

#In this exercise, we will build a recommender function
#get_recommendations(), as discussed in the lesson and the
#previous exercise. As we know, it takes in a title, a cosine
#similarity matrix, and a movie title and index mapping as
#arguments and outputs a list of 10 titles most similar to the
#original title (excluding the title itself).

#You have been given a dataset metadata that consists of the
#movie titles and overviews. The head of this dataset has been
#printed to console.

#################################################
#               title                                            tagline
#938  Cinema Paradiso  A celebration of youth, friendship, and the ev...
#630         Spy Hard  All the action. All the women. Half the intell...
#682        Stonewall                    The fight for the right to love
#514           Killer                    You only hurt the one you love.
#365    Jason's Lyric                                   Love is courage.
#################################################

# Generate mapping between titles and index
#indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

#def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that matches title
#    idx = indices[title]
    # Sort the movies based on the similarity scores
#    sim_scores = list(enumerate(cosine_sim[idx]))
#    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
#    sim_scores = sim_scores[1:11]
    # Get the movie indices
#    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
#    return metadata['title'].iloc[movie_indices]

#################################################
#With this recommender function in our toolkit, we are now in a
#very good place to build the rest of the components of our
#recommendation engine.

In [None]:
#TED talk recommender

#In this exercise, we will build a recommendation system that
#suggests TED Talks based on their transcripts. You have been
#given a get_recommendations() function that takes in the title
#of a talk, a similarity matrix and an indices series as its
#arguments, and outputs a list of most similar talks. indices
#has already been provided to you.

#You have also been given a transcripts series that contains the
#transcripts of around 500 TED talks. Your task is to generate a
#cosine similarity matrix for the tf-idf vectors of the talk
#transcripts.

#Consequently, we will generate recommendations for a talk
#titled '5 ways to kill your dreams' by Brazilian entrepreneur
#Bel Pesce.

# Initialize the TfidfVectorizer
#tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
#tfidf_matrix = tfidf.fit_transform(transcripts)

# Generate the cosine similarity matrix
#cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Generate recommendations
#print(get_recommendations('5 ways to kill your dreams', cosine_sim, indices))

#################################################
#<script.py> output:
#    453             Success is a continuous journey
#    157                        Why we do what we do
#    494                   How to find work you love
#    149          My journey into movies that matter
#    447                        One Laptop per Child
#    230             How to get your ideas to spread
#    497         Plug into your hard-wired happiness
#    495    Why you will fail to have a great career
#    179             Be suspicious of simple stories
#    53                          To upgrade is human
#    Name: title, dtype: object
#################################################
#You have successfully built a TED talk recommender. This
#recommender works surprisingly well despite being trained only
#on a small subset of TED talks. In fact, three of the talks
#recommended by our system is also recommended by the official
#TED website as talks to watch next after '5 ways to kill your
#dreams'!

**Beyond n-grams: word embeddings**
___
![_images/19.9.PNG](_images/19.9.PNG)
- Word embeddings
    - mapping words into an n-dimensional vector space
    - produced using deep learning and huge amounts of data
    - discern how similar two words are to each other
    - used to detect synonyms and antonyms
    - captures complex relationships
        King - Queen -> Man - Woman
        France - Paris -> Russia - Moscow
    - dependent on spaCy model; independent of data set you use
___

In [1]:
#Generating word vectors

#In this exercise, we will generate the pairwise similarity
#scores of all the words in a sentence. The sentence is available
#as sent and has been printed to the console for your convenience.

sent = 'I like apples and oranges'

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_lg')

# Create the doc object
doc = nlp(sent)

# Compute pairwise similarity scores
for token1 in doc:
  for token2 in doc:
    print(token1.text, token2.text, token1.similarity(token2))

#################################################
#Notice how the words 'apples' and 'oranges' have the highest
#pairwaise similarity score. This is expected as they are both
#fruits and are more related to each other than any other pair
#of words.

I I 1.0
I like 0.55549127
I apples 0.20442726
I and 0.31607854
I oranges 0.1882408
like I 0.55549127
like like 1.0
like apples 0.3298714
like and 0.5267484
like oranges 0.27717474
apples I 0.20442726
apples like 0.3298714
apples apples 1.0
apples and 0.24097733
apples oranges 0.77809423
and I 0.31607854
and like 0.5267484
and apples 0.24097733
and and 1.0
and oranges 0.19245945
oranges I 0.1882408
oranges like 0.27717474
oranges apples 0.77809423
oranges and 0.19245945
oranges oranges 1.0


In [2]:
#Computing similarity of Pink Floyd songs

#In this final exercise, you have been given lyrics of three
#songs by the British band Pink Floyd, namely 'High Hopes',
#'Hey You' and 'Mother'. The lyrics to these songs are available
#as hopes, hey and mother respectively.

#Your task is to compute the pairwise similarity between mother
#and hopes, and mother and hey.

mother = "\nMother do you think they'll drop the bomb?\nMother do you think they'll like this song?\nMother do you think they'll try to break my balls?\nOoh, ah\nMother should I build the wall?\nMother should I run for President?\nMother should I trust the government?\nMother will they put me in the firing mine?\nOoh ah,\nIs it just a waste of time?\nHush now baby, baby, don't you cry.\nMama's gonna make all your nightmares come true.\nMama's gonna put all her fears into you.\nMama's gonna keep you right here under her wing.\nShe won't let you fly, but she might let you sing.\nMama's gonna keep baby cozy and warm.\nOoh baby, ooh baby, ooh baby,\nOf course mama's gonna help build the wall.\nMother do you think she's good enough, for me?\nMother do you think she's dangerous, to me?\nMother will she tear your little boy apart?\nOoh ah,\nMother will she break my heart?\nHush now baby, baby don't you cry.\nMama's gonna check out all your girlfriends for you.\nMama won't let anyone dirty get through.\nMama's gonna wait up until you get in.\nMama will always find out where you've been.\nMama's gonna keep baby healthy and clean.\nOoh baby, ooh baby, ooh baby,\nYou'll always be baby to me.\nMother, did it need to be so high?\n"
hopes = "\nBeyond the horizon of the place we lived when we were young\nIn a world of magnets and miracles\nOur thoughts strayed constantly and without boundary\nThe ringing of the division bell had begun\nAlong the Long Road and on down the Causeway\nDo they still meet there by the Cut\nThere was a ragged band that followed in our footsteps\nRunning before times took our dreams away\nLeaving the myriad small creatures trying to tie us to the ground\nTo a life consumed by slow decay\nThe grass was greener\nThe light was brighter\nWhen friends surrounded\nThe nights of wonder\nLooking beyond the embers of bridges glowing behind us\nTo a glimpse of how green it was on the other side\nSteps taken forwards but sleepwalking back again\nDragged by the force of some in a tide\nAt a higher altitude with flag unfurled\nWe reached the dizzy heights of that dreamed of world\nEncumbered forever by desire and ambition\nThere's a hunger still unsatisfied\nOur weary eyes still stray to the horizon\nThough down this road we've been so many times\nThe grass was greener\nThe light was brighter\nThe taste was sweeter\nThe nights of wonder\nWith friends surrounded\nThe dawn mist glowing\nThe water flowing\nThe endless river\nForever and ever\n"
hey = "\nHey you, out there in the cold\nGetting lonely, getting old\nCan you feel me?\nHey you, standing in the aisles\nWith itchy feet and fading smiles\nCan you feel me?\nHey you, don't help them to bury the light\nDon't give in without a fight\nHey you out there on your own\nSitting naked by the phone\nWould you touch me?\nHey you with you ear against the wall\nWaiting for someone to call out\nWould you touch me?\nHey you, would you help me to carry the stone?\nOpen your heart, I'm coming home\nBut it was only fantasy\nThe wall was too high\nAs you can see\nNo matter how he tried\nHe could not break free\nAnd the worms ate into his brain\nHey you, out there on the road\nAlways doing what you're told\nCan you help me?\nHey you, out there beyond the wall\nBreaking bottles in the hall\nCan you help me?\nHey you, don't tell me there's no hope at all\nTogether we stand, divided we fall\n"

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_lg')

# Create Doc objects
mother_doc = nlp(mother)
hopes_doc = nlp(hopes)
hey_doc = nlp(hey)

# Print similarity between mother and hopes
print(mother_doc.similarity(hopes_doc))

# Print similarity between mother and hey
print(mother_doc.similarity(hey_doc))

#################################################
#Notice that 'Mother' and 'Hey You' have a similarity score of
#0.9 whereas 'Mother' and 'High Hopes' has a score of only 0.6.
#This is probably because 'Mother' and 'Hey You' were both songs
#from the same album 'The Wall' and were penned by Roger Waters.
#On the other hand, 'High Hopes' was a part of the album
#'Division Bell' with lyrics by David Gilmour and his wife,
#Penny Samson. Treat yourself by listening to these songs.
#They're some of the best!

0.8653563052842299
0.9595266707315476


**Congratulations!**
___
- Review
    - basic features (characters, words, mentions, etc.)
    - readability scores
    - tokenization and lemmatization
    - text cleaning
    - part-of-speech tagging & named entity recognition
    - n-gram modelling
    - tf-idf
    - Cosine similarity
    - word embeddings
- Further DataCamp courses
    - advanced NLP with spaCy
    - Deep Learning in Python

___
