## A intenção do projeto é criar um chatbot baseado em reviews de filmes para que se possa fazer perguntas e manter uma conversa livre sobre este tema

- link do banco de dados https://www.kaggle.com/Cornell-University/movie-dialog-corpus?select=movie_lines.tsv
- referências
>- https://shanebarker.com/blog/deep-learning-chatbot/
> -https://towardsdatascience.com/how-to-create-a-chatbot-with-python-deep-learning-in-less-than-an-hour-56a063bdfc44

In [45]:
import string
import nltk
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
import pandas as pd
import re
import numpy as np
import math
import random
from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to /home/douglas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Opening movie reviews

In [46]:
messages = pd.read_csv('./chatdata/movie_lines_normalized.tsv', header = None, delimiter="\t", quoting=3, encoding='ISO-8859-2')

In [47]:
messages.columns = ['msg_line', 'user_id', 'movie_id', 'msg']

In [48]:
messages.head(10)

Unnamed: 0,msg_line,user_id,movie_id,msg
0,L1045,u0,m0,They do not!
1,L1044,u2,m0,They do to!
2,L985,u0,m0,I hope so.
3,L984,u2,m0,She okay?
4,L925,u0,m0,Let's go.
5,L924,u2,m0,Wow
6,L872,u0,m0,Okay -- you're gonna need to learn how to lie.
7,L871,u2,m0,No
8,L870,u0,m0,"""""""I'm kidding. You know how sometimes you jus..."
9,L869,u0,m0,Like my fear of wearing pastels?


In [49]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304713 entries, 0 to 304712
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   msg_line  304713 non-null  object
 1   user_id   304713 non-null  object
 2   movie_id  304713 non-null  object
 3   msg       304713 non-null  object
dtypes: object(4)
memory usage: 9.3+ MB


In [50]:
messages.describe()

Unnamed: 0,msg_line,user_id,movie_id,msg
count,304713,304713,304713,304713
unique,304713,9035,659,265277
top,L316222,u4525,m289,What?
freq,1,537,1530,1679


### Cleaning the msg_line of the conversations

In [51]:
#remove charactes
def remove_char(txt):
    return re.sub('[^0-9]','', txt)

In [52]:
#leaving just the number of the index, so L872 changes to 872
messages['msg_line_clean'] = [remove_char(msg) for msg in messages['msg_line']]

In [53]:
#change the column type to number
messages['msg_line_clean'] = pd.to_numeric(messages['msg_line_clean'])

In [54]:
messages = messages.sort_values(by=['msg_line_clean'])

In [55]:
#set the column as the index
messages = messages.set_index('msg_line_clean')

In [56]:
messages.head(10)

Unnamed: 0_level_0,msg_line,user_id,movie_id,msg
msg_line_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
49,L49,u0,m0,Did you change your hair?
50,L50,u3,m0,No.
51,L51,u0,m0,You might wanna think about it
59,L59,u9,m0,I missed you.
60,L60,u8,m0,It says here you exposed yourself to a group o...
61,L61,u9,m0,It was a bratwurst. I was eating lunch.
62,L62,u8,m0,With the teeth of your zipper?
63,L63,u7,m0,You the new guy?
64,L64,u2,m0,So they tell me...
65,L65,u7,m0,C'mon. I'm supposed to give you the tour.


### Removing entities

In [57]:
entities = pd.read_csv('./chatdata/entity_list.tsv', header = None, delimiter="\t", quoting=3, encoding='ISO-8859-2')

In [58]:
entities.columns = ['ent', 'type']

In [59]:
entities.head()

Unnamed: 0,ent,type
0,Kinda,ORG
1,The Dallas Times Herald,ORG
2,Queen Louisa,PERSON
3,A.M,GPE
4,Cousin Hop,PERSON


In [60]:
entities['ent_len'] = [len(e) for e in entities['ent']]

In [61]:
s = entities['ent_len'].sort_values(ascending=False).index

In [62]:
entities = entities.reindex(s)

In [63]:
entities = entities.reset_index(drop=True)

In [64]:
entities.head()

Unnamed: 0,ent,type,ent_len
0,"""""""How can the Bolshevik cause gain respect am...",WORK_OF_ART,237
1,"""""""The Premier wishes to inform the Government...",WORK_OF_ART,192
2,""""""" Come Tuesday twelve a.m. bingo these like-...",WORK_OF_ART,182
3,"""""""The suggestion of the President regarding t...",WORK_OF_ART,155
4,"""""""The Management of Boyd's takes pleasure in ...",WORK_OF_ART,146


In [65]:
data = messages['msg']

In [66]:
ent_list =  ['PERSON', 'ORG', 'NORP', 'FAC', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART', 'LAW', 'LANGUAGE']
#ent_list =  ['LANGUAGE']

ent = list()
for i in range(len(entities.index)):
    if entities['type'][i] in ent_list:
        ent.append(entities['ent'][i])

In [67]:
ent = list(set(ent))
print(len(ent))
print(ent)

28514


In [68]:
dict = {}
for n in ent:
    dict[n] = len(n)

In [69]:
#sort dict by biggest values
dict = {k: v for k, v in sorted(dict.items(), key=lambda item: item[1], reverse=True)}

In [70]:
dict

{'"""How can the Bolshevik cause gain respect among the Moslems if your three representatives Buljanoff Iranoff and Kopalski get so drunk that they throw a carpet out of their hotel window and complain to the management that it didn\'t fly"': 237,
 '"""The Premier wishes to inform the Government of the United States that it will be impossible for him to attend the meeting suggested by the President unless the meeting is held in Moscow."""': 192,
 '""" Come Tuesday twelve a.m. bingo these like-minded deviates log on and start yakking it up: explicit sex crime gossip who did what to whom who wants to do what when why and how."""': 182,
 '"""The suggestion of the President regarding the possibility of a meeting in Moscow would be unacceptable to Her Majesty\'s Government at the present time."': 155,
 '"""The Management of Boyd\'s takes pleasure in requesting the company of Mr. Richard Starkey that\'s you in their recently refinished gaming rooms."': 146,
 '"""Well Jim I says, it makes me 

In [71]:
def remove_entity(corpus):
    corpus = corpus.split(' ')
    corpus = [c for c in corpus if c not in list(dict.keys())]
    return ' '.join(corpus)    

In [72]:
#chepoint
messages.to_csv('./chatdata/movie_lines_pre_processed.tsv', index=False, sep='\t', header=True)

### Opening conversation sequence

In [73]:
#read the file with the conversation sequence
conv_seq = pd.read_csv('./chatdata/movie_conversations.tsv', header = None, delimiter="\t", quoting=3, encoding='ISO-8859-2')

In [74]:
conv_seq.columns = ['user1_id', 'user2_id', 'movie_id', 'sequence']

In [75]:
conv_seq.head(10)

Unnamed: 0,user1_id,user2_id,movie_id,sequence
0,u0,u2,m0,['L194' 'L195' 'L196' 'L197']
1,u0,u2,m0,['L198' 'L199']
2,u0,u2,m0,['L200' 'L201' 'L202' 'L203']
3,u0,u2,m0,['L204' 'L205' 'L206']
4,u0,u2,m0,['L207' 'L208']
5,u0,u2,m0,['L271' 'L272' 'L273' 'L274' 'L275']
6,u0,u2,m0,['L276' 'L277']
7,u0,u2,m0,['L280' 'L281']
8,u0,u2,m0,['L363' 'L364']
9,u0,u2,m0,['L365' 'L366']


In [76]:
conv_seq.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83097 entries, 0 to 83096
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user1_id  83097 non-null  object
 1   user2_id  83097 non-null  object
 2   movie_id  83097 non-null  object
 3   sequence  83097 non-null  object
dtypes: object(4)
memory usage: 2.5+ MB


In [77]:
conv_seq.describe()

Unnamed: 0,user1_id,user2_id,movie_id,sequence
count,83097,83097,83097,83097
unique,5420,5608,617,83097
top,u4331,u1475,m289,['L657815' 'L657816']
freq,193,187,338,1


### Build conversation sequence

In [78]:
def split_conversation(txt):
    txt_alt = txt.split(' ')
    return txt_alt

In [79]:
def seq_to_list(seq):
    seq_list = [remove_char(s) for s in seq]
    return seq_list

In [80]:
#initializing the msg_2 column
messages['msg_2'] = '-'

In [81]:
def link_conversations(seq_list, df, filter1, filter2):
    i = 0
    while i in range(len(seq_list)):
        if i+1 < len(seq_list):
            next_msg = df.loc[int(seq_list[i+1]), filter1]
            df.at[int(seq_list[i]), filter2] = next_msg
        i+=1

In [82]:
#link each message with its answer
for conv in conv_seq['sequence']:
    #split each sequence by space
    seq = split_conversation(conv)

    #remove the char L from the sequences
    txt_alt = [remove_char(s) for s in seq]

    #use the conversation sequence to build the target answer for each message
    link_conversations(txt_alt, messages, 'msg', 'msg_2')

In [83]:
messages.head(30)

Unnamed: 0_level_0,msg_line,user_id,movie_id,msg,msg_2
msg_line_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
49,L49,u0,m0,Did you change your hair?,No.
50,L50,u3,m0,No.,You might wanna think about it
51,L51,u0,m0,You might wanna think about it,-
59,L59,u9,m0,I missed you.,It says here you exposed yourself to a group o...
60,L60,u8,m0,It says here you exposed yourself to a group o...,It was a bratwurst. I was eating lunch.
61,L61,u9,m0,It was a bratwurst. I was eating lunch.,With the teeth of your zipper?
62,L62,u8,m0,With the teeth of your zipper?,-
63,L63,u7,m0,You the new guy?,So they tell me...
64,L64,u2,m0,So they tell me...,C'mon. I'm supposed to give you the tour.
65,L65,u7,m0,C'mon. I'm supposed to give you the tour.,-


## Pre processing the msg

In [84]:
data = messages['msg']

In [85]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [86]:
lemmatizer = WordNetLemmatizer()
def pre_processing_text(corpus):
    #remove html tags
    corpus = re.sub(r'<.*?>', '', str(corpus))
    
    #remove non-alphanumeric characters
    corpus = re.sub(r'[^a-z A-Z 0-9 \s]', '', str(corpus))
    
    #remove duplicated spaces
    corpus = re.sub(r' +', ' ', str(corpus))
    
    #capitalization
    corpus = corpus.lower()
    
    #tokenization
    corpus = re.findall(r"\w+(?:'\w+)?|[^\w\s]", corpus)
    
    #lammatization
    corpus = [lemmatizer.lemmatize(c, get_wordnet_pos(c)) for c in corpus]
    
    #remove punctuation
    corpus = [t for t in corpus if t not in string.punctuation]
    
    #remove stopwords
    #it makes the model worst
    stopwords_ = stopwords.words("english")
    corpus = [t for t in corpus if t not in stopwords_]
    
    corpus = ' '.join(corpus)

    return corpus

In [87]:
%%time
data_pre_processed = [pre_processing_text(str(m)) for m in data]
data_pre_processed

CPU times: user 8min 59s, sys: 3min 42s, total: 12min 42s
Wall time: 12min 45s


['change hair',
 '',
 'might wanna think',
 'miss',
 'say expose group freshman girl',
 'bratwurst eat lunch',
 'teeth zipper',
 'new guy',
 'tell',
 'cmon im suppose give tour',
 'dakota',
 'north actually howd',
 'kid people actually live',
 'yeah couple outnumber cow though',
 'many people old school',
 'thirtytwo',
 'get',
 'many people go',
 'couple thousand evil',
 'im use',
 'yeah guy never see horse jack clint eastwood',
 'girl',
 'burn pine perish',
 '',
 'bianca stratford sophomore dont even think',
 '',
 'could start haircut doesnt matter shes allow date old sister thats impossibility',
 'katarina stratford youve terrorize blaise',
 'express opinion terrorist action',
 'well yes compare choice expression year today event quite mild way bobby rictors gonad retrieval operation go quite well case youre interested',
 'still maintain kick ball merely spectator',
 'point kat people perceive somewhat',
 'tempestuous',
 'believe heinous bitch term use often',
 '',
 'patrick verona r

In [88]:
messages['msg_pre_processed'] = data_pre_processed

### Checking for duplicated messages in msg

In [89]:
data = messages['msg_pre_processed']

In [90]:
dict = {}
for n in data:
    if n in dict:
        dict[n] = dict[n] + 1
    else:
        dict[n] = 1

In [91]:
#sort dict by biggest values
dict = {k: v for k, v in sorted(dict.items(), key=lambda item: item[1], reverse=True)}

In [92]:
dict

{'': 12843,
 'yes': 2121,
 'yeah': 1552,
 'know': 876,
 'go': 692,
 'okay': 653,
 'right': 629,
 'dont know': 629,
 'sure': 554,
 'oh': 535,
 'say': 514,
 'get': 508,
 'whats': 467,
 'mean': 462,
 'well': 441,
 'come': 411,
 'think': 406,
 'want': 393,
 'thank': 379,
 'talk': 351,
 'really': 347,
 'dont': 325,
 'like': 319,
 'hello': 315,
 'unknown': 310,
 'thanks': 305,
 'happen': 291,
 'yes sir': 289,
 'course': 277,
 'see': 276,
 'good': 263,
 'nothing': 263,
 'tell': 256,
 'huh': 253,
 'hi': 252,
 'fuck': 249,
 'im sorry': 247,
 'sir': 241,
 'cant': 237,
 'hell': 220,
 'look': 214,
 'thats right': 202,
 'uhhuh': 195,
 'love': 191,
 'hey': 187,
 'im': 177,
 'whats wrong': 173,
 'thats': 165,
 'fine': 162,
 'one': 162,
 'way': 156,
 'sorry': 156,
 'time': 151,
 'shit': 147,
 'excuse': 145,
 'shut': 145,
 'guess': 140,
 'oh god': 139,
 'please': 138,
 'dont think': 136,
 'didnt': 135,
 'stop': 135,
 'maybe': 130,
 'oh yeah': 127,
 'told': 119,
 'wait': 117,
 'ok': 116,
 'help': 115,
 

In [93]:
#example of duplcated msg
messages[messages['msg_pre_processed'] == 'yes']

Unnamed: 0_level_0,msg_line,user_id,movie_id,msg,msg_2,msg_pre_processed
msg_line_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


In [94]:
#get the repeated messages
d_list = list()
for k in dict:
    if dict[k] > 1:
        d_list.append(k)

In [95]:
d_list

['',
 'yes',
 'yeah',
 'know',
 'go',
 'okay',
 'right',
 'dont know',
 'sure',
 'oh',
 'say',
 'get',
 'whats',
 'mean',
 'well',
 'come',
 'think',
 'want',
 'thank',
 'talk',
 'really',
 'dont',
 'like',
 'hello',
 'unknown',
 'thanks',
 'happen',
 'yes sir',
 'course',
 'see',
 'good',
 'nothing',
 'tell',
 'huh',
 'hi',
 'fuck',
 'im sorry',
 'sir',
 'cant',
 'hell',
 'look',
 'thats right',
 'uhhuh',
 'love',
 'hey',
 'im',
 'whats wrong',
 'thats',
 'fine',
 'one',
 'way',
 'sorry',
 'time',
 'shit',
 'excuse',
 'shut',
 'guess',
 'oh god',
 'please',
 'dont think',
 'didnt',
 'stop',
 'maybe',
 'oh yeah',
 'told',
 'wait',
 'ok',
 'help',
 'work',
 'dont understand',
 'much',
 'jesus',
 'whats matter',
 'would',
 'let go',
 'long',
 'believe',
 'alright',
 'great',
 'kill',
 'dad',
 'dont believe',
 'youre',
 'call',
 'take',
 'whats go',
 'ask',
 'bad',
 'could',
 'leave',
 'never',
 'name',
 'yet',
 'wont',
 'understand',
 'exactly',
 'forget',
 'im sure',
 'whats name',
 'ki

In [96]:
messages = messages.drop_duplicates(subset=['msg_pre_processed'])

In [120]:
#example of duplcated msg
messages[messages['msg_pre_processed'] == 'yes']

Unnamed: 0_level_0,msg_line,user_id,movie_id,msg,msg_2,msg_pre_processed,target
msg_line_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
817,L817,u9,m0,Yes,I don't like to do what people expect. Then th...,yes,0


In [98]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 229108 entries, 49 to 666576
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   msg_line           229108 non-null  object
 1   user_id            229108 non-null  object
 2   movie_id           229108 non-null  object
 3   msg                229108 non-null  object
 4   msg_2              229108 non-null  object
 5   msg_pre_processed  229108 non-null  object
dtypes: object(6)
memory usage: 12.2+ MB


In [99]:
messages.describe()

Unnamed: 0,msg_line,user_id,movie_id,msg,msg_2,msg_pre_processed
count,229108,229108,229108,229108,229108,229108
unique,229108,8980,617,229108,145702,229108
top,L316222,u3681,m289,If anything ever happens to me...,-,onei take care grow tough neighbourhood ive ha...
freq,1,415,1126,1,64196,1


## Removing empty messages

In [130]:
messages[messages['msg_pre_processed'] == '']

Unnamed: 0_level_0,msg_line,user_id,movie_id,msg,msg_2,msg_pre_processed,target
msg_line_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1


In [131]:
messages = messages.drop(messages[messages['msg_pre_processed'] == ''].index)

### Removing nan msg origined by '' messages

In [123]:
#return generic answer
def generic_answer(txt):
  asw_list = ['talk more about it',
              'can you explain it better?',
              'I need to think more about it',
              'maybe...'
              ]
  if txt == '-' or txt == '':
    return random.choice(asw_list)
  return txt

In [None]:
messages[messages['msg_pre_processed'].isna()]

In [None]:
#filling the nan messages with a string- not necessary
messages = messages.fillna(generic_answer('-'))

### Removing apostrophes (need for embedding and page rank  dictionary)

In [117]:
messages['msg_pre_processed'] = [ word.replace("\'","") for word in messages['msg_pre_processed'] ]

### Filling '-' or '' messages with a generic one

In [118]:
#seting a generic answer to the messages without answer
messages['msg_2'] = [generic_answer(msg) for msg in messages['msg_2']]

In [127]:
messages.head(30)

Unnamed: 0_level_0,msg_line,user_id,movie_id,msg,msg_2,msg_pre_processed,target
msg_line_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
49,L49,u0,m0,Did you change your hair?,change hair,change hair,1
50,L50,u3,m0,No.,can you explain it better?,talk more about it,0
51,L51,u0,m0,You might wanna think about it,might wanna think,might wanna think,0
59,L59,u9,m0,I missed you.,miss,miss,0
60,L60,u8,m0,It says here you exposed yourself to a group o...,say expose group freshman girl,say expose group freshman girl,0
61,L61,u9,m0,It was a bratwurst. I was eating lunch.,bratwurst eat lunch,bratwurst eat lunch,0
62,L62,u8,m0,With the teeth of your zipper?,teeth zipper,teeth zipper,1
63,L63,u7,m0,You the new guy?,new guy,new guy,1
64,L64,u2,m0,So they tell me...,tell,tell,0
65,L65,u7,m0,C'mon. I'm supposed to give you the tour.,cmon im suppose give tour,cmon im suppose give tour,0


### Tagging the msg with classes

In [106]:
def define_target(corpus):
    
    if '?' in corpus:
        return 1
    else:
        return 0

In [107]:
data = messages['msg']

In [108]:
messages['target'] = [define_target(m) for m in data]

In [109]:
messages['target'] = messages['target'].astype(int)

In [110]:
messages.head(20)

Unnamed: 0_level_0,msg_line,user_id,movie_id,msg,msg_2,msg_pre_processed,target
msg_line_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
49,L49,u0,m0,Did you change your hair?,No.,change hair,1
50,L50,u3,m0,No.,You might wanna think about it,,0
51,L51,u0,m0,You might wanna think about it,can you explain it better?,might wanna think,0
59,L59,u9,m0,I missed you.,It says here you exposed yourself to a group o...,miss,0
60,L60,u8,m0,It says here you exposed yourself to a group o...,It was a bratwurst. I was eating lunch.,say expose group freshman girl,0
61,L61,u9,m0,It was a bratwurst. I was eating lunch.,With the teeth of your zipper?,bratwurst eat lunch,0
62,L62,u8,m0,With the teeth of your zipper?,can you explain it better?,teeth zipper,1
63,L63,u7,m0,You the new guy?,So they tell me...,new guy,1
64,L64,u2,m0,So they tell me...,C'mon. I'm supposed to give you the tour.,tell,0
65,L65,u7,m0,C'mon. I'm supposed to give you the tour.,talk more about it,cmon im suppose give tour,0


### Save data

In [111]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 229108 entries, 49 to 666576
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   msg_line           229108 non-null  object
 1   user_id            229108 non-null  object
 2   movie_id           229108 non-null  object
 3   msg                229108 non-null  object
 4   msg_2              229108 non-null  object
 5   msg_pre_processed  229108 non-null  object
 6   target             229108 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 14.0+ MB


In [112]:
messages

Unnamed: 0_level_0,msg_line,user_id,movie_id,msg,msg_2,msg_pre_processed,target
msg_line_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
49,L49,u0,m0,Did you change your hair?,No.,change hair,1
50,L50,u3,m0,No.,You might wanna think about it,,0
51,L51,u0,m0,You might wanna think about it,can you explain it better?,might wanna think,0
59,L59,u9,m0,I missed you.,It says here you exposed yourself to a group o...,miss,0
60,L60,u8,m0,It says here you exposed yourself to a group o...,It was a bratwurst. I was eating lunch.,say expose group freshman girl,0
...,...,...,...,...,...,...,...
666522,L666522,u9034,m616,So far only their scouts. But we have had repo...,talk more about it,far scout report small impi farther north,0
666546,L666546,u9027,m616,Splendid site Crealock splendil I want to esta...,Certainly Sin,splendid site crealock splendil want establish...,0
666547,L666547,u9029,m616,Certainly Sin,talk more about it,certainly sin,0
666575,L666575,u9028,m616,Choose your targets men. That's right Watch th...,Keep steady. You're the best shots of the Twen...,choose target men thats right watch marker 55,0


In [113]:
messages.to_csv('./chatdata/movie_lines_pre_processed.tsv', index=False, sep='\t', header=True)

In [114]:
messages[0:3000].to_csv('./chatdata/movie_lines_pre_processed_for_test.tsv', index=False, sep='\t', header=True)