### Tokenization part II ###

** working with text **

In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

The CSV file format is not standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line-breaks. CSV implementations may not handle such field data, or they may use quotation marks to surround the field. Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences.

In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. These alternate delimiter-separated files are often even given a .csv extension despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange. Many applications that accept CSV files have options to select the delimiter character and the quotation character.

In [1]:
text = """In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

The CSV file format is not standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line-breaks. CSV implementations may not handle such field data, or they may use quotation marks to surround the field. Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences.

In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. These alternate delimiter-separated files are often even given a .csv extension despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange. Many applications that accept CSV files have options to select the delimiter character and the quotation character.

"""

Extracting sentences ? extracting paragraphs ?

In [2]:
paragraphs = text.split("\n")

print(paragraphs[0])

In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.


In [4]:
print(paragraphs[1]) ## why is empty ?




In [14]:
paragraphs = text.split("\n\n") ## solution double \n\n ...

print(paragraphs[1])

The CSV file format is not standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line-breaks. CSV implementations may not handle such field data, or they may use quotation marks to surround the field. Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences.


** a better solution **

In [6]:
import re

sp = re.split('\n+', text) ## splitting using regular expressions
print(sp)

['In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.', 'The CSV file format is not standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line-breaks. CSV implementations may not handle such field data, or they may use quotation marks to surround the field. Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences.', 'In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters. These include tab-separated values and space-separated values. A delimiter that is not present 

In [8]:
help(re)

Help on module re:

NAME
    re - Support for regular expressions (RE).

MODULE REFERENCE
    https://docs.python.org/3.5/library/re
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last mat

** splitting sentences simple way **

In [18]:
for s in sp:
    sentences = s.split(".")
    print(sentences)
    break

['In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text', ' Each line of the file is a data record', ' Each record consists of one or more fields, separated by commas', ' The use of the comma as a field separator is the source of the name for this file format', '']


## tokenization professional way ##

** using NLTK natural language toolkig **

In [13]:
import nltk 
from nltk.tokenize import sent_tokenize

In [None]:
"punkt", "stopwords", "twitter_samples"

In [16]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/mick/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

** to use the sentence tokenizer we need an english corpus called "punkt" with nltk.download()**

In [22]:
sentences = sent_tokenize(text)

In [24]:
for s in sentences:
    print(s)

In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text.
Each line of the file is a data record.
Each record consists of one or more fields, separated by commas.
The use of the comma as a field separator is the source of the name for this file format.
The CSV file format is not standardized.
The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line-breaks.
CSV implementations may not handle such field data, or they may use quotation marks to surround the field.
Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences.
In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters.
These include tab-separated values and space-separated values.
A delimiter that is not present in the f

In [28]:
complex_case = "This is a sentence. In this one there is an acronym N.A.S.A to treat like a sentence."
sentences = sent_tokenize(complex_case)

In [29]:
for c in sentences:
    print(c)

This is a sentence.
In this one there is an acronym N.A.S.A to treat like a sentence.


** tokenizing words **

In [30]:
from nltk.tokenize import word_tokenize

In [31]:
tokens = word_tokenize(text)

In [33]:
print(tokens)

['In', 'computing', ',', 'a', 'comma-separated', 'values', '(', 'CSV', ')', 'file', 'stores', 'tabular', 'data', '(', 'numbers', 'and', 'text', ')', 'in', 'plain', 'text', '.', 'Each', 'line', 'of', 'the', 'file', 'is', 'a', 'data', 'record', '.', 'Each', 'record', 'consists', 'of', 'one', 'or', 'more', 'fields', ',', 'separated', 'by', 'commas', '.', 'The', 'use', 'of', 'the', 'comma', 'as', 'a', 'field', 'separator', 'is', 'the', 'source', 'of', 'the', 'name', 'for', 'this', 'file', 'format', '.', 'The', 'CSV', 'file', 'format', 'is', 'not', 'standardized', '.', 'The', 'basic', 'idea', 'of', 'separating', 'fields', 'with', 'a', 'comma', 'is', 'clear', ',', 'but', 'that', 'idea', 'gets', 'complicated', 'when', 'the', 'field', 'data', 'may', 'also', 'contain', 'commas', 'or', 'even', 'embedded', 'line-breaks', '.', 'CSV', 'implementations', 'may', 'not', 'handle', 'such', 'field', 'data', ',', 'or', 'they', 'may', 'use', 'quotation', 'marks', 'to', 'surround', 'the', 'field', '.', 'Quo

** special tokenizer **

In [9]:
from nltk.tokenize import TweetTokenizer

s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"

model = TweetTokenizer()

tokens = model.tokenize(s0)

print(tokens)

['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']


notice how the tokenizer treated the emoticons

In [43]:
import json
from nltk.tokenize import TweetTokenizer

In [44]:
file_location = "/home/mick/nltk_data/corpora/twitter_samples/negative_tweets.json"

In [45]:
model = TweetTokenizer()

In [56]:
countWords = {}
howmanywords = set()
j = 0
for t in allTokens:
    #print(t)
    for word in t:
        #print(word)
        howmanywords.add(word)
        if(j % 10000 == 0):
            print(len(howmanywords))
        if(word in countWords ):
            countWords[word] = countWords[word] + 1
        else:
            countWords[word] =  1
        j  = j  +1 ## j += 1


1
3251
5372
7069
8725
10467
12013


In [60]:
import pandas as pd

In [63]:
df = pd.DataFrame.from_dict(countWords,orient="index")

In [68]:
df.sort_values(by=0,ascending = False).head(100)

Unnamed: 0,0
:(,4585
I,1587
(,1180
.,1092
to,1068
the,846
!,831
",",734
you,660
?,644


In [59]:
for word,counts in countWords.items():
    print(word,counts)

💓 2
@ApplePieQueen_ 1
http://t.co/taVMCz37E7 1
visiting 1
Please 21
@xBethanyOystonx 1
Until 1
http://t.co/r5fMnyWyUf 1
@LucasUpton 1
ervin 1
Caroline 1
Ashraf 1
#MUFC 1
#nudes 3
waaah 1
#PitchWars 1
filming 2
http://t.co/UR8ZwngzJZ 1
#hornykik 17
Nowt 1
#Kadhafi 2
Utd 2
thumb 1
spooky 1
32 1
@1q4h_ 1
Ce 1
budget 1
UNEXPECT 1
oops 2
argument 1
pero 1
QUIT 1
PHANTASY 1
crazy 7
pas 1
CANNOT 1
trash 2
@teamkins 1
scarf 1
Waking 1
http://t.co/iL86HQ4Uyh 1
Subdivision 1
havent 9
hairs 1
@Janettaras 1
townssssss 1
Toll 1
listening 4
@Marx_Envy 2
@tanginarrymo 1
Emma 1
@bbcweather 1
Nor 1
Sry 1
Nerve 1
nanaman 1
@rvirenee_ 1
bestfriends 1
getaway 1
NEMEN 1
choregrapher 1
viber 1
BIOMES 1
🍹 3
need 81
babyy 1
@haestarxx 1
@HUNCH0 1
@LHBF7F 1
@thetrin 1
agessss 1
samjha 1
Lost 1
charts 2
633 1
BROOKE 1
@josselynramos01 1
saturday 4
yet 32
dormmates 1
critical 1
Dear 1
lasting 1
honestly 3
Dard 1
23 2
babies 4
@GABRlEIIE 1
photo 6
dissappeared 1
Oh 33
something 26
nation 2
ised 1
Royal 1
dongwoo 

AE 1
1k 2
sentir-se 1
prescription 1
flying 3
@shakyra_cledera 1
politics 1
tears 1
MEAL 1
WTF 2
Ive 3
To 4
desc 1
agains 1
walked 1
barbells 1
#BOMBING 1
WORK 1
June 3
Hey 12
321 1
@seungwannabe 1
@itisfurny 1
@tyrafendy 1
teal 1
@tallertara 1
surely 2
25 3
sight 1
ging 1
Weird 1
fank 1
#wakeupGOP 1
Srsly 1
SHE'D 1
eggs 1
gift 10
claim 2
tumblr 1
dissapointed 1
legends 1
》 210
@soorjugn 1
Full 1
@ohmydayz 1
@Meem_Hye 1
@SANDEUL203092 1
asap 1
@MissKittyDomme 1
@lucyreesxo 1
bummed 1
way 42
pie 1
Same 3
@sasaribena 1
poc 1
medication 1
#Frustration 1
errors 2
Stress 2
vote 4
aisyah 1
TOMMYY 1
Their 4
@marmaisxcz 1
game 18
ugh 15
disgusts 1
Definitely 1
https://t.co/JxjroN8PcY 1
cartoon 1
@AdoreDelano 1
cycle 2
@ShellyBarker123 1
Hoping 2
@OpTic_Mochila 1
@Errata_0 1
biggest 3
entertain 2
might 9
Nope 2
randomly 1
fbc 3
Louis 1
Tobermory 1
bean 1
Hong 1
@atiraxia 1
@eydiespi 1
f 1
broh 1
grow 2
basket 1
#39 1
#video 3
https://t.co/k7bFXN9H5V 1
@sandwichlove_ 1
@rcdlccom 1
rosie 1
recove

@durnurr 1
whay 1
Pakighinabi 1
3am 2
FORGETTING 1
Club 1
fucking 15
http://t.co/75IDDesHD0 1
@interpedia_ 1
smoking 1
foreals 1
@pastelwolfxx 1
@2baconil 1
@wbuharryy 1
pumpkin 1
@farhani_nfr 1
@SwitInno 1
https://t.co/IpxqMQWFeM 1
W 1
climbing 1
@KimDyvotees 1
certainly 2
zz 1
apartment 1
NODE 1
done 24
midnight 2
@izzsugden 1
18th 2
Polaroid 1
child 2
cheeky 1
@CarltonFC 1
classes 2
http://t.co/iVcyNmm8vP 1
veeeeerry 1
@itmeailung 1
Depends 1
secondary 1
tries 1
jot 1
tay 1
sandwich 3
Sunway 1
meeee 1
tdy 1
fire 1
@MissStaceyVee 1
CNN 1
pension 1
bees 1
@swiftstruelove 1
blog 1
gorilla 1
http://t.co/qtgCn7Wi1P 1
busier 1
Worlds 1
@iamnonexistent 1
@sweetbabecake 1
😞 3
hernia 3
@carliot23 1
seems 7
Et 1
live 26
crafting 1
Ken 1
#FreebieFriday 1
Wrong 1
@infinite7muse 1
Everything 3
@bowserrh 1
Damn 2
True 3
5am 5
logged 1
nightmare 4
persona 1
ticket 4
@Wufanited 1
package 1
normal 3
greymind 2
Delph 4
Jeebus 1
perfect 1
Dheena 1
@pinkle_dhesi 1
til 5
@LeanneHirst 1
AFTERNOON 1
@Gaze

2.5 1
http://t.co/lR3JUxPtJG 1
friend 14
al 2
dairy 1
Uff 1
saaaad 1
12 8
haestarr 1
#lgbt 3
kirkiri 1
sex 2
funk 1
hiding 1
#TURKEY 1
nooooo 1
@WLK_SNaeun 1
@ROOM94 1
@unclutching 1
7.30- 1
office 7
bomb 2
22nd 1
download 1
40 3
invalid 1
Casillas 1
somebody 1
Nemanja 1
PERF 1
Do 8
Math 1
Inc 2
http://t.co/l99IanEBsi 1
Jongdae 1
@gabriel_platon 1
hbd 1
organization 1
@myoddballs 1
testing 1
seniors 1
@AmeAmeSakura 1
Mady 1
easily 2
flight 6
´ 4
finale 2
@TheOGB 1
bloopers 2
complained 1
jahat 2
https://t.co/H8nYlHaQOo 1
Practice 1
TL 2
@BillieJoeSpouse 1
kick 1
WHERE 2
eunji 1
crime 1
faith 2
every 14
Eh 1
#interracial 3
suan 1
345 1
@S1dharthM 1
Laper 1
http://t.co/Lh5sNTHmIV 1
with 172
freed 1
probably 8
beasts 1
pathetic 1
richard 1
boss 4
hr 1
zzzz 1
joking 4
z 4
relatives 1
akana 1
#unloved 1
mass 1
realise 4
shame 13
@Muselshoux 1
#bblf 1
76 1
@jxhun 1
@JodanasandyXx 1
project 1
15-24 1
overall 1
PRICE 1
Rejection 1
@AnnieIsDoomed 1
badly 12
@ameliahartin 1
Handsome 1
Kylie 1
#F

FOLLOWED 102
Mubark 1
@enikotsz 1
manual 1
blocks 2
they've 3
THOMAS 1
Anna 2
ov 2
@salmasl_ 1
HOPING 1
@SilverkissesTV 1
Does 1
@marixyanchik1 1
Flat 1
honey 2
stud 1
#ugh 1
elsewhere 1
7 14
container 1
advisory 1
black 12
@Hydrojeon 1
perf 1
You'll 3
magpie 1
ACTUALLY 2
siannn 1
suggest 2
@imcherrycblls 1
:'D 1
picnic 1
Big 6
tweak 1
substitution 1
tOWNS 1
@alkapranos 1
imma 1
card 5
bae's 1
#loveofmylife 2
@HuaAng75 1
WEEK 2
@MrTomBaker 1
@sophielbradshaw 2
meningitis 1
BACKKK 1
AoS 1
#countrymusic 4
location 4
yesterday 13
regret 3
septum 1
near 9
assingnment 1
bananas 1
@MoreConsole 1
tzelumxoxo 1
possible 5
44 1
injurys 1
endlessly 1
tunnel 2
@cmoan3 1
@flytetymejam 1
@charleybilton 1
movie 15
DOM 1
quarter 1
Noooo 2
@AWSSupport 1
person 10
ate 9
@kiyomitsucashew 1
SNOB 1
partied 1
CARE 1
FELL 1
http://t.co/YbOKUQDWyE 1
E 5
ah 9
@banamnam 1
somehow 2
RAIN 2
losing 1
helppp 1
knackered 1
@Citadel_Hoju 1
ICU 1
https://t.co/lMAAJ9Kmvk 1
gim 1
beardy 1
Proposed 1
@japhantrash 1
SLP 1

@3nymph 1
cause 13
THEIR 2
lahhh 1
heard 6
Ruby 2
https://t.co/3dQB9Pt3UY 1
nowdays 1
@ParkSooyoungie 1
@FluffyBearsPS3 1
goys 1
EXO 1
@larakiara 1
fallen 3
faggot 1
shut 2
amiibo 1
@thatdavidmiller 1
slightly 2
apb 2
@natzaz17 1
@LeeUUHN 1
@iTaimikhan 1
AWAY 1
Ekta 1
impossible 4
@GustoPizzaDM 1
YouTube 2
baaack 1
victory 1
assholes 1
https://t.co/Pm8mxoGpEn 1
ANOTHER 1
greatful 1
swimming 1
@AgathaChelsea18 1
@OloapZurc 2
pj's 1
notes 1
paalam 1
yes 26
times 19
nai 1
nicely 1
Met 1
Yach 1
@AsdaServiceTeam 2
snsd 2
@HelpwDms 1
@osullivand 1
@sonicretro 2
text 8
committed 1
ANDROMEDA 1
Skulker 1
@bornsinqer 1
fried 1
diplomacy 1
challo 1
solo 5
@wtfxmbs 1
bay 1
ukiss 1
co 2
limit 2
HELL 1
jenners 1
odds 1
@Danica_Yu 1
@mermaid_bl00d 1
@b_lurryface 1
https://t.co/Ribf3SkrDI 1
Smi 1
rehearsal 1
soniii 1
@tbhrapmon 1
@joyce_gleek 1
ALREADY 2
ring 1
@rickygervais 2
https://t.co/cI5FPi66co 1
xenophobes 1
mañana 1
cb 1
@Walls 1
CHANNEL 1
hala 1
@hitgal_hashmi 1
yum 1
Neil 1
lots 3
WHEN 3
sco

didn 1
lonely 8
Ry 1
called 5
Center 1
toni 1
@eckoxsoldier 1
@kitteninlaces 1
hacharatt 1
@SarahLucero 1
SaSin 1
Darren 2
= 2
weekends 1
nalang 1
bud 1
@Nayritje 1
#beauty 1
Nakakapikon 1
website 6
Seeing 1
bittersweetness 1
but 384
Journey 1
tonnes 1
#isolated 1
usual 2
worried 3
answer 9
podcast 1
frequently 1
rash 1
Woaah 1
like 193
58543 1
gave 8
Miami 1
teeth 6
Nick 1
marathon 3
Nearly 1
http://t.co/zpLPgKesOH 1
@Glam_And_Gore 1
leads 1
green 6
Hull 1
#DomesticViolence 1
workouts 1
@dxniellacueto 1
#anywayhedidanicejob 1
on-board 1
@Its_Divine_D 1
soon 41
@malikm0ney 1
BAKIT 1
@ChaeHyungwon_ 1
@LeyejGilmour 1
Academics 1
selena 1
juries 1
@nba2kmobile 1
gallon 1
model 3
goals 2
WOKE 1
bach 1
Feel 8
@znclair 2
perpetually 1
GOAT 1
heading 2
@epiphanichood 1
probs 1
@Uber_RSA 2
@pixiesuga 1
@HeartYorkshire 1
rlyhurts 1
looked 5
snow 2
daw 2
@trcpqveen 1
@lemoncandykiss 1
fwm 1
AbbyMill 2
smol 1
We're 14
@tescomobile 2
Al 2
@itstishh 1
doo 2
hopes 2
uuughhhh 1
http://t.co/KWdxYb9dBC

@StevensSlateCo 1
pliss 1
@wendykims 1
@Rossi_mac9306 1
esteem 1
typos 1
respect 3
maganda 1
whatsapp 1
up.Come 1
seatmates 1
Ami 1
ding 3
#edm 1
@Cour98_x 1
nuggets 2
Chelsea 1
1d 1
@BloutAngelina 1
mirror 1
Found 1
http://t.co/b3qkLaFJx4 1
hitting 1
zoomed 1
@Ch4rm41n3 1
ramzan 1
dinner 4
que 1
breakups 2
Labour's 1
@ayoo_gerry 1
servers 1
Montana 1
@Uber_Chennai 1
@QuetaAuthor 1
playables 1
5:15 1
Councillors 1
@janellearg 1
@MassDeception1 2
seemed 1
@kewlaf 1
12.00 1
http://t.co/iWAeDmuooP 1
duck 1
@SP3CTACL3S 1
@GlastoFest 1
Finnair 1
@Rhianne_97 1
sholong 1
gotten 1
set 9
short 8
rib 1
yelaaaaaaa 1
price 1
@Morteraaaaa 1
💋 2
lure 1
shor 1
striker 2
within 3
Kian 1
Cory 1
gorgeous 1
@twinitisha 1
Rantie 1
Odoo 1
leh 1
Pyaari 1
rlly 2
http://t.co/wN3169K7Kb 1
whips 1
vips 1
@Julia886 1
streams 3
lootcrate 1
HES 1
https://t.co/Hgr1dMQ1eQ 1
Slovakia's 1
Cuddling 1
poor 25
@namwngr 1
it'll 3
deserved 1
Struggling 1
negotiate 2
LA 3
disc 1
curtain 1
models 1
cut 13
SORRRY 1
deactivate

In [58]:
print(len(howmanywords))

12545


In [69]:


allUsers = []

for line in open(file_location,"r"):
    js = json.loads(line.rstrip())
    user = js['user']['screen_name']
    
    allUsers.append(user)


In [72]:
len(allUsers)

5000