# Module 1: Working with Text in Python

## Working with text
<br>

**String Operations**

`s.lower()`; `s.upper()`; `s.titlecase()`

`s.split(t)`

`s.splitlines()`

`s.join(t)`

`s.strip()`; `s.rstrip()`

`s.find(t)`; `s.rfind(t)`

`s.replace(u, v)`

`s.startswith(t)` 

`s.endswith(t)` 

`t in s` 

`s.isupper()`; `s.islower()`; `s.istitle()` 

`s.isalpha()`; `s.isdigit()`; `s.isalnum()`

**File Operations**

`f = open(filename, mode)`

`f.readline()`; `f.read()`; `f.read(n)`

`for line in f: doSomething(line)`

`f.seek(n)`

`f.write(message)`

`f.close()`

`f.close`

In [97]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1)

76

In [98]:
text2 = text1.split(' ')
len(text2)

14

In [99]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

In [100]:
[w for w in text2 if len(w) > 3]

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [101]:
[w for w in text2 if w.istitle()]

['Ethics', 'United', 'Nations']

In [102]:
[w for w in text2 if w.endswith('s')]

['Ethics', 'ideals', 'objectives', 'Nations']

In [103]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [104]:
len(set(text4))

5

In [105]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [106]:
len(set([w.lower() for w in text4]))

4

In [107]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

In [1]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'

In [108]:
text6 = text5.split(' ')
text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

In [5]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

In [6]:
[w for w in text6 if w.startswith('@')]

['@']

In [7]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

In [8]:
text8

['@UN',
 '@UN_Women',
 '"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

In [9]:
[w for w in text8 if w.startswith('#')]

['#UNSG']

In [10]:
[w for w in text8 if w.startswith('@')]

['@UN', '@UN_Women', '@']

## Regular Expressions
<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [13]:
import re

In [12]:
[w for w in text8 if re.search('@[A-Za-z0-9_]+',w)]

['@UN', '@UN_Women']

### Meta-characters: Character matches 
`.` : wildcard, matches a single character.

`^` : start of a string

`$` : end of a string

`[]` : matches one of the set of characters within []

`[a-z]` : matches one of the range of characters a, b, …, z

`[^abc]` : matches a character that is not a, b, or, c

`a|b` : matches either a or b, where a and b are strings

`()` : Scoping for operators

`\` : Escape character for special characters (\t, \n, \b)

### Meta-characters: Character symbols 
`\b` : Matches word boundary

`\d` : Any digit, equivalent to [0-9]

`\D` : Any non-digit, equivalent to [^0-9]

`\s` : Any whitespace, equivalent to [ \t\n\r\f\v]

`\S` : Any non-whitespace, equivalent to [^ \t\n\r\f\v]

`\w` : Alphanumeric character, equivalent to [a-zA-Z0-9_]

`\W` : Non-alphanumeric, equivalent to [^a-zA-Z0-9_]


### Meta-characters: Repetitions 
`*` : matches zero or more occurrences 

`+` : matches one or more occurrences 

`?` : matches zero or one occurrences 

`{n}` : exactly n repetitions, n≥ 0 

`{n,}` : at least n repetitions 

`{,n}` : at most n repetitions 

`{m,n}` : at least m and at most n repetitions

In [14]:
[w for w in text8 if re.search('@\w+',w)]

['@UN', '@UN_Women']

In [20]:
text9 = 'ouagadougou'

In [21]:
re.findall(r'[aeiou]',text9)

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']

In [22]:
re.findall(r'[^aeiou]',text9)

['g', 'd', 'g']

### Regular expression for Dates
`\d{2}[/-]\d{2}[/-]\d{4}` **Means**

Two digits << `/`or`-` << Two digits << `/`or`-` << Four digits

`\d{1,2}` **Means** one or two digits

In [25]:
dateStr = "23-10-2002\n23/10/2002\n23/10/02\n10/23/2002\n23 Oct 2002\n23 October 2002\nOct 23, 2002\nOctober 23, 2002\n"

In [26]:
re.findall('\d{2}[/-]\d{2}[/-]\d{4}',dateStr)

['23-10-2002', '23/10/2002', '10/23/2002']

In [27]:
re.findall('\d{2}[/-]\d{2}[/-]\d{2,4}',dateStr)

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002']

In [28]:
re.findall('\d{1,2}[/-]\d{1,2}[/-]\d{2,4}',dateStr)

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002']

In [29]:
re.findall(r'\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', dateStr)

['Oct']

Bracket indicates scoping meaning that it will only pulls the match inside the brackets.

In [30]:
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', dateStr)

['23 Oct 2002']

Using the question mark colon `?:` special character sequence indicates that not only pull the inside bracket match but the hole string.

In [32]:
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{4}', dateStr)

['23 Oct 2002', '23 October 2002']

In [33]:
re.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}', dateStr)

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']

## Working with Text Data in pandas

In [34]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


In [46]:
# find the number of characters for each string in df['text']
df['text'].str.len()

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

In [48]:
# find the number of tokens for each string in df['text']
df['text'].str.split().str.len()

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

In [51]:
# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

In [54]:
df[df['text'].str.contains('appointment') == True].count()

text    2
dtype: int64

In [55]:
# find how many times a digit occurs in each string
df['text'].str.count(r'\d')

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

In [56]:
# find all occurances of the digits
df['text'].str.findall(r'\d')

0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

In [59]:
# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

In [62]:
df['text'].str.findall(r'\d?\d:\d\d')

0            [2:45]
1           [11:30]
2            [7:00]
3           [11:15]
4    [08:10, 09:00]
Name: text, dtype: object

In [79]:
# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b','???')

  df['text'].str.replace(r'\w+day\b','???')


0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [83]:
# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)',lambda x: x.groups()[0][:3])

  df['text'].str.replace(r'(\w+day\b)',lambda x: x.groups()[0][:3])


0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [87]:
df['text'].str.findall(r'(\w+day\b)')

0       [Monday]
1      [Tuesday]
2    [Wednesday]
3     [Thursday]
4       [Friday]
Name: text, dtype: object

In [88]:
# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')

Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


In [90]:
# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


In [93]:
df['text'].str.findall(r'(\d?\d):(\d\d) ?([ap]m)')

0                   [(2, 45, pm)]
1                  [(11, 30, am)]
2                   [(7, 00, pm)]
3                  [(11, 15, pm)]
4    [(08, 10, am), (09, 00, am)]
Name: text, dtype: object

In [94]:
df['text'].str.findall(r'((\d?\d):(\d\d) ?([ap]m))')

0                              [(2:45pm, 2, 45, pm)]
1                           [(11:30 am, 11, 30, am)]
2                              [(7:00pm, 7, 00, pm)]
3                           [(11:15 pm, 11, 15, pm)]
4    [(08:10 am, 08, 10, am), (09:00am, 09, 00, am)]
Name: text, dtype: object

In [96]:
# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<Time>(?P<Hour>\d?\d):(?P<Minute>\d\d) ?(?P<Period>[ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,Time,Hour,Minute,Period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


# Module 2: Basic Natural Language Processing

In [1]:
import nltk

In [9]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [10]:
from nltk.book import *

In [13]:
texts()

text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [14]:
text1

<Text: Moby Dick by Herman Melville 1851>

In [15]:
sents()

sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .


In [16]:
sent1

['Call', 'me', 'Ishmael', '.']

In [17]:
text7

<Text: Wall Street Journal>

In [18]:
sent7

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [19]:
len(sent7)

18

In [20]:
len(text7)

100676

In [21]:
len(set(text7))

12408

In [22]:
list(set(text7))[:10]

['Perhaps',
 'spot',
 'noncompetitively',
 'Driskill',
 'Auditors',
 '100,980',
 'enact',
 'loans',
 'Phipps',
 'trimming']

In [23]:
dist = FreqDist(text7)

In [24]:
len(dist)

12408

In [25]:
vocab = dist.keys()

In [28]:
list(vocab)[:10]

['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']

In [29]:
dist['join']

4

In [35]:
freqwords = [w for w in vocab if len(w)>5 and dist[w]>100]
freqwords

['billion',
 'company',
 'president',
 'because',
 'market',
 'million',
 'shares',
 'trading',
 'program']

## Normalization and Stemming

Different forms of the same “word”

In [36]:
input1 = "List listed lists listing listings"

In [38]:
words1 = input1.lower().split(' ')
words1

['list', 'listed', 'lists', 'listing', 'listings']

In [41]:
porter = nltk.PorterStemmer()

In [43]:
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

## Lemmatization

Stemming, but resulting stems are all valid words

In [44]:
udhr = nltk.corpus.udhr.words('English-Latin1')

In [45]:
udhr[:20]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'rights',
 'of']

In [46]:
[porter.stem(t) for t in udhr]

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of',
 'all',
 'member',
 'of',
 'the',
 'human',
 'famili',
 'is',
 'the',
 'foundat',
 'of',
 'freedom',
 ',',
 'justic',
 'and',
 'peac',
 'in',
 'the',
 'world',
 ',',
 'wherea',
 'disregard',
 'and',
 'contempt',
 'for',
 'human',
 'right',
 'have',
 'result',
 'in',
 'barbar',
 'act',
 'which',
 'have',
 'outrag',
 'the',
 'conscienc',
 'of',
 'mankind',
 ',',
 'and',
 'the',
 'advent',
 'of',
 'a',
 'world',
 'in',
 'which',
 'human',
 'be',
 'shall',
 'enjoy',
 'freedom',
 'of',
 'speech',
 'and',
 'belief',
 'and',
 'freedom',
 'from',
 'fear',
 'and',
 'want',
 'ha',
 'been',
 'proclaim',
 'as',
 'the',
 'highest',
 'aspir',
 'of',
 'the',
 'common',
 'peopl',
 ',',
 'wherea',
 'it',
 'is',
 'essenti',
 ',',
 'if',
 'man',
 'is',
 'not',
 'to',
 'be',
 'compel',
 'to',
 'have',
 'recours',
 '

In [47]:
wnlemma = nltk.WordNetLemmatizer()

In [48]:
[wnlemma.lemmatize(t) for t in udhr]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of',
 'all',
 'member',
 'of',
 'the',
 'human',
 'family',
 'is',
 'the',
 'foundation',
 'of',
 'freedom',
 ',',
 'justice',
 'and',
 'peace',
 'in',
 'the',
 'world',
 ',',
 'Whereas',
 'disregard',
 'and',
 'contempt',
 'for',
 'human',
 'right',
 'have',
 'resulted',
 'in',
 'barbarous',
 'act',
 'which',
 'have',
 'outraged',
 'the',
 'conscience',
 'of',
 'mankind',
 ',',
 'and',
 'the',
 'advent',
 'of',
 'a',
 'world',
 'in',
 'which',
 'human',
 'being',
 'shall',
 'enjoy',
 'freedom',
 'of',
 'speech',
 'and',
 'belief',
 'and',
 'freedom',
 'from',
 'fear',
 'and',
 'want',
 'ha',
 'been',
 'proclaimed',
 'a',
 'the',
 'highest',
 'aspiration',
 'of',
 'the',
 'common',
 'people',
 ',',
 'Whereas',
 'it',
 'is',
 'essential',
 ',',
 'if',
 'man',
 'is',
 'not',
 'to',
 

## Tokenization

- Recall splitting a sentence into words / tokens
- NLTK has an in-built tokenizer

In [49]:
text11 = "Children shouldn't drink a sugary drink before bed."

In [50]:
text11.split(' ')

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

In [51]:
nltk.word_tokenize(text11)

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

## Sentence Splitting

NLTK has an in-built sentence splitter too!

In [53]:
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"

In [54]:
sentences = nltk.sent_tokenize(text12)

In [55]:
len(sentences)

4

In [56]:
sentences

['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']

## Part-of-speech (POS) Tagging
![pos.png](attachment:pos.png)

In [57]:
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [58]:
nltk.help.upenn_tagset('CC')

CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet


In [60]:
text11 = "Children shouldn't drink a sugary drink"

In [62]:
text12 = nltk.word_tokenize(text11)
text12

['Children', 'should', "n't", 'drink', 'a', 'sugary', 'drink']

In [63]:
nltk.pos_tag(text12)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN')]

# Module 3:Text Classification

## Case Study - Sentiment Analysis

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv('lab-files/Amazon_Unlocked_Mobile.csv')
df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [4]:
df.dropna(inplace=True)
df =df[df['Rating'] != 3]
df['Positively Rated'] = np.where(df['Rating']>3,1,0)

In [5]:
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1
5,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,1,I already had a phone with problems... I know ...,1.0,0
6,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,The charging port was loose. I got that solder...,0.0,0
7,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,"Phone looks good but wouldn't stay charged, ha...",0.0,0
8,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I originally was using the Samsung S2 Galaxy f...,0.0,1
11,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,This is a great product it came after two days...,0.0,1


In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['Reviews'],df['Positively Rated'], random_state=0)

In [7]:
print('X_train first entry:\n\n',X_train[0])
print('\n\nX_train shape',X_train.shape)

X_train first entry:

 I feel so LUCKY to have found this used (phone to us & not used hard at all), phone on line from someone who upgraded and sold this one. My Son liked his old one that finally fell apart after 2.5+ years and didn't want an upgrade!! Thank you Seller, we really appreciate it & your honesty re: said used phone.I recommend this seller very highly & would but from them again!!


X_train shape (231207,)


In [8]:
print('y_train haed:\n\n',y_train.head())
print('\n\ny_train shape',y_train.shape)

y_train haed:

 97039     0
243783    1
88792     0
388802    1
161607    1
Name: Positively Rated, dtype: int32


y_train shape (231207,)


<br>

`CountVectorizer()`

Looking at X_train,
we can see we have a series of over 231,000 reviews or documents.
We'll need to convert these into a numeric representation that scikit-learn can use.
The bag-of-words approach is simple and commonly used way to represent text for
use in machine learning, which ignores structure and
only counts how often each word occurs.
CountVectorizer allows us to use the bag-of-words approach
by converting a collection of text documents into a matrix of token counts. 

`.fit`

Fitting the CountVectorizer tokenizes each document by finding all sequences of
characters of at least two letters or numbers separated by word boundaries.
Converts everything to lowercase and builds a vocabulary using these tokens. 

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer().fit(X_train)

In [10]:
vect

In [11]:
feature_names = vect.get_feature_names_out()
feature_names[::2000]

array(['00', '4less', 'adr6275', 'assignment', 'blazingly', 'cassettes',
       'condishion', 'debi', 'dollarsshipping', 'esteem', 'flashy',
       'gorila', 'human', 'irullu', 'like', 'microsaudered',
       'nightmarish', 'p770', 'poori', 'quirky', 'responseive', 'send',
       'sos', 'synch', 'trace', 'utiles', 'withstanding'], dtype=object)

In [29]:
len(feature_names)

53216

<br>

`vect.transform(X_train)`

Transform the documents in X_train to a document term matrix, giving us the bag-of-word representation of X_train.

This representation is stored in a SciPy sparse matrix, where each row corresponds to a document and each column a word from our training vocabulary. 

The entries in this matrix are the number of times each word appears in each document. **<231207x53216>**

In [12]:
X_train_vectorized = vect.transform(X_train)
X_train_vectorized

<231207x53216 sparse matrix of type '<class 'numpy.int64'>'
	with 6117776 stored elements in Compressed Sparse Row format>

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

model = LogisticRegression()
model.fit(X_train_vectorized,y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ',roc_auc_score(predictions,y_test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


AUC:  0.9353104406316923


In [14]:
sorted_coef_index = model.coef_[0].argsort()
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['worst' 'garbage' 'junk' 'unusable' 'false' 'worthless' 'useless'
 'crashing' 'disappointing' 'awful']

Largest Coefs:
['excelent' 'excelente' 'exelente' 'loving' 'loves' 'perfecto' 'excellent'
 'complaints' 'awesome' 'buen']



In [47]:
model.coef_[0]

array([-0.25880291,  0.19583603,  0.02249367, ...,  0.00115398,
        0.153294  ,  0.01216856])

In [41]:
# sorted_coef_index

array([52310, 21272, 26705, ..., 18547, 18377, 18376], dtype=int64)

In [35]:
# feature_names[sorted_coef_index[:10]]

array(['worst', 'garbage', 'junk', 'unusable', 'false', 'worthless',
       'useless', 'crashing', 'disappointing', 'awful'], dtype=object)

In [43]:
# feature_names[sorted_coef_index[:-11:-1]]

array(['excelent', 'excelente', 'exelente', 'loving', 'loves', 'perfecto',
       'excellent', 'complaints', 'awesome', 'buen'], dtype=object)

### Term Frequency-Inverse Document Frequency (Tfidf)

<br>

- Features with low tf–idf are either commonly used across all documents or rarely used and only occur in long documents.
- Features with high tf–idf are frequently used within specific documents, but rarely used across all documents.

<br>
CountVectorizor and tf–idf Vectorizor both take an argument,
mindf, which allows us to specify a minimum number of documents
in which a token needs to appear to become part of the vocabulary. 

<br>

This helps us remove some words that might appear in only a few and
are unlikely to be useful predictors.
For example, here we'll pass in min_df = 5, which will remove any words
from our vocabulary that appear in fewer than five documents. 

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(min_df=5).fit(X_train)
feature_names_tf = tfidf_vect.get_feature_names_out()
len(feature_names_tf)

17951

**Num of features 've been reduced from 53216 to 17951**

In [16]:
X_train_vectorized_tf = tfidf_vect.transform(X_train)

# tf_model = LogisticRegression()
# model.fit(X_train_vectorized_tf,y_train)

# tf_predictions = model.predict(tfidf_vect.transform(X_test))

# print('AUC: ',roc_auc_score(y_test,tf_predictions))

In [17]:
sorted_tfidf_index = X_train_vectorized_tf.max(0).toarray()[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names_tf[sorted_tfidf_index[0:10]]))
print('Largest Coefs:\n{}\n'.format(feature_names_tf[sorted_tfidf_index[:-11:-1]]))

Smallest Coefs:
['commenter' 'pthalo' 'warmness' 'storageso' 'aggregration' '1300'
 '625nits' 'a10' 'submarket' 'brawns']

Largest Coefs:
['defective' 'batteries' 'gooood' 'epic' 'luis' 'goood' 'basico'
 'aceptable' 'problems' 'excellant']



In [71]:
# X_train_vectorized_tf

<231207x17951 sparse matrix of type '<class 'numpy.float64'>'
	with 6056695 stored elements in Compressed Sparse Row format>

In [73]:
# X_train_vectorized_tf.max(0)

<1x17951 sparse matrix of type '<class 'numpy.float64'>'
	with 17951 stored elements in COOrdinate format>

In [74]:
# X_train_vectorized_tf.max(0).toarray()

array([[0.71042189, 0.32454897, 0.31976905, ..., 0.4614497 , 0.49678755,
        0.49678755]])

In [76]:
# X_train_vectorized_tf.max(0).toarray()[0]

array([0.71042189, 0.32454897, 0.31976905, ..., 0.4614497 , 0.49678755,
       0.49678755])

In [77]:
# X_train_vectorized_tf.max(0).toarray()[0].argsort()

array([ 3624, 12532, 17320, ...,  7414,  2184,  4635], dtype=int64)

In [78]:
# feature_names_tf[X_train_vectorized_tf.max(0).toarray()[0].argsort()[0:10]]

array(['commenter', 'pthalo', 'warmness', 'storageso', 'aggregration',
       '1300', '625nits', 'a10', 'submarket', 'brawns'], dtype=object)

In [80]:
# feature_names_tf[X_train_vectorized_tf.max(0).toarray()[0].argsort()[:-11:-1]]

array(['defective', 'batteries', 'gooood', 'epic', 'luis', 'goood',
       'basico', 'aceptable', 'problems', 'excellant'], dtype=object)