<a href="https://colab.research.google.com/github/VijayaBhargavi198/5731Assignments/blob/master/In_class_exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The fifth In-class-exercise (9/30/2020, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)

In [None]:
!apt-get update
!apt install chromium-chromedriver
!pip install selenium
!pip install textblob

In [2]:
# Webscrap and collect data using selenium
from selenium import webdriver
import time
import pandas as pd
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# open it, go to a website, and get results
driver = webdriver.Chrome('chromedriver',options=options)
driver.get("https://www.imdb.com/title/tt7286456/reviews?ref_=tt_urv") 
for i in range(0,4):
  button =driver.find_element_by_xpath('//*[@id = "load-more-trigger"]').click()
  time.sleep(5)
output = list()
reviews = driver.find_elements_by_class_name("content")
for review in reviews:
  output.append(review.find_element_by_css_selector(".text.show-more__control").text)
  pd.DataFrame(output,columns = ["review"]).to_csv("User_reviews")
#create data frame for all the reviews for data cleaning
result_df= pd.DataFrame(output,columns = ["User_Review"])
#result_df

In [8]:
#Data Cleaning steps:
#1.Remove noise, such as special characters and punctuations.
#2.Remove numbers
#3.Remove stopwords by using the stopwords list
#4.Lowercase all texts
#5.Stemming
#6.Lemmatization.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer 
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stemmer=PorterStemmer()
lemmatizer = WordNetLemmatizer() 
stop_words=stopwords.words("english")
review_filter_stemming=[]
review_filter_Lemmatization=[]
final_stemmed_review=[]
final_lemmatized_review=[]
# cleaning the text data
result_df['User_Review'] = result_df['User_Review'].str.replace(r"\W", " ").str.strip()# 1.To remove special characters and punctuations
result_df['User_Review']= result_df['User_Review'].str.replace(r'\d+',"") #2.To remove Numbers
for a in result_df['User_Review']:
    splitting_words=word_tokenize(a)
    for b in splitting_words:
        if b not in stop_words: #3.Removing Stop Words
            b=b.lower() # 4.Coverting text to lower case 
            stemming=stemmer.stem(b) # 5.perfroming stemming
            review_filter_stemming.append(stemming)
            lemmatization_words=lemmatizer.lemmatize(b.lower()) # 6.Lemmatization
            review_filter_Lemmatization.append(lemmatization_words)
    final_stemmed_review.append(' '.join(review_filter_stemming))
    final_lemmatized_review.append(' '.join(review_filter_Lemmatization))
    review_filter_stemming.clear()
    review_filter_Lemmatization.clear()
result_df['Review after Stemming']=pd.DataFrame(final_stemmed_review)
result_df['Review after Lemmatization']=pd.DataFrame(final_lemmatized_review)
result_df['word_count'] = result_df["Review after Lemmatization"].apply(lambda x : len(x.split(" ")))
#result_df


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [9]:
# function definition to compute the tf idf and tf-idf values for the starting word in the text

#computeTF
def computeTF(sentence):
  words = sentence.split(" ")
  value = len(set(words))
  return words.count(words[0])/value
  pass

#compute IDF
import math 
def computeIDF(sentence):
  words = sentence.lower().split(" ")
  res = words.count(words[0])
  return math.log(len(words)/res, 10)
  pass

In [10]:
result_df["tf"] = result_df["Review after Lemmatization"].apply(lambda x : round(computeTF(x),4))
result_df["idf"] = result_df["Review after Lemmatization"].apply(lambda x : round(computeIDF(x),4))
result_df["tf-idf"] = result_df["tf"]*result_df["idf"]
#result_df

In [12]:
# syntax and structure analysis
#3.1 Parts of Speech (POS) Tagging: 
import nltk
from collections import Counter
nltk.download('averaged_perceptron_tagger')
#initialising
tags=[]
count_tag=[]
count=0
# POS tagging
for val in result_df['Review after Lemmatization']:
    word_tokens=word_tokenize(val)
    POS_tags=nltk.pos_tag(word_tokens)
    tags.append(POS_tags)
from collections import Counter
def get_pos(value,tags):
  pos_counts=[]
  for j in tags:
    counts = Counter(tag for word,tag in j)   
    pos_counts.append(counts.get(value))
  return pos_counts


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [13]:
import numpy as np
result_df["Number of nouns"] = pd.DataFrame(get_pos("NN",tags))
result_df["Number of verbs"] = pd.DataFrame(get_pos("VB",tags))
result_df["Number of predeterminers"] = pd.DataFrame(get_pos("PDT",tags))
result_df["Number of adverbs"] = pd.DataFrame(get_pos("RB",tags))
result_df["Number of cardinal digits"] = pd.DataFrame(get_pos("CD", tags))
result_df["Number of prepositions"] = pd.DataFrame(get_pos("IN",tags))
result_df["Number of Determiners"] = pd.DataFrame(get_pos("DT",tags)) 
result_df["Number of adjectives"] = pd.DataFrame(get_pos("JJ",tags)) 
result_df["Number of conjunctions"] = pd.DataFrame(get_pos("CC",tags))
result_df["Number of pronouns"] = pd.DataFrame(get_pos("PP",tags))
result_df["Number of Punctuations"] = pd.DataFrame(get_pos(":",tags))
result_df = result_df.fillna(0)
pronouns = pd.DataFrame(get_pos("PP",tags))
prepositions = pd.DataFrame(get_pos("IN",tags))
punctuations = pd.DataFrame(get_pos(":",tags))
conjunctions = pd.DataFrame(get_pos("CC",tags))
result_df["Density of pronouns"] = result_df["Number of pronouns"]/pronouns[0]
result_df["Density of prepositions"] = result_df["Number of prepositions"]/prepositions[0]
result_df["Density of punctuations"] = result_df["Number of Punctuations"]/punctuations[0]
result_df["Density of conjunctions"] = result_df["Number of conjunctions"]/conjunctions[0]
result_df = result_df.fillna(0)
#result_df

In [18]:
final_df = result_df[['User_Review','word_count','tf','idf','tf-idf','Number of nouns','Number of adverbs','Number of cardinal digits','Number of verbs','Number of Determiners','Number of adjectives',"Density of pronouns","Density of prepositions","Density of punctuations","Density of conjunctions"]]
final_df

Unnamed: 0,User_Review,word_count,tf,idf,tf-idf,Number of nouns,Number of adverbs,Number of cardinal digits,Number of verbs,Number of Determiners,Number of adjectives,Density of pronouns,Density of prepositions,Density of punctuations,Density of conjunctions
0,Every once in a while a movie comes that trul...,50,0.0208,1.6990,0.035339,24.0,5.0,0.0,0.0,1.0,13.0,0,0.0,0,0.0
1,This is a movie that only those who have felt ...,43,0.0270,1.6335,0.044104,15.0,4.0,1.0,2.0,3.0,4.0,0,1.0,0,1.0
2,Truly a masterpiece The Best Hollywood film o...,86,0.0299,1.6335,0.048842,29.0,6.0,1.0,3.0,4.0,14.0,0,1.0,0,1.0
3,Joaquin Phoenix gives a tour de force performa...,54,0.0222,1.7324,0.038459,24.0,2.0,0.0,2.0,3.0,10.0,0,1.0,0,0.0
4,Most of the time movies are anticipated like t...,52,0.0213,1.7160,0.036551,18.0,4.0,0.0,1.0,2.0,11.0,0,1.0,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120,My husband and I went to see this film and fou...,253,0.0051,2.4031,0.012256,94.0,24.0,0.0,4.0,2.0,51.0,0,1.0,0,1.0
121,JOKER is a gift to the audiences I felt as a ...,27,0.0385,1.4314,0.055109,15.0,2.0,0.0,0.0,0.0,2.0,0,0.0,0,0.0
122,Dark depressing and unsettling film with a ha...,17,0.0588,1.2304,0.072348,7.0,0.0,0.0,0.0,0.0,5.0,0,0.0,0,0.0
123,,1,1.0000,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0,0.0
