<h2> Chat Conversations Dataset</h2>
<p>Dataset obtained exported from whatsapp, conversation spaning 18 months <br> </p>
<p> Traditional Data Processing techniques applied to the dataset, such as: </p>

<li> Lowercase all words </li>
<li> Remove punctuation </li>
<li> Remove Stop Words </li>



<h3> Dependencies</h3>

In [1]:
import pandas as pd
import pprint
from collections import defaultdict
from future.utils import iteritems
import string
import json
import csv

<h3> Dataset </h3>
<br>
<i>Note:</i> <p> The dataset as exported OOB from Whatsapp is in .<i>txt</i> format, MS Excel was used to convert to csv using the <i> Get External data option </i></p>

In [2]:
chat = "data_conv.csv"
chat_df = pd.read_csv(chat, encoding = "ISO-8859-1")

<h3> Data processing pipeline </h3>
<br>
<p> Functions defined to run on the dataset, mind a little bit the order of execution when calling these functions</p>

In [3]:
def drop_nan(df):
    """Gets a DataFrame and drops all NaNs

    Parameters
    ----------
    df : dataframe
        Dataframe containing the whatsapp exported conversation
    
    Returns
    -------
    DataFrame
        df without NaN in its rows
    """
    
    df = df.dropna()
    return df


def to_list(sentence):
    """Gets and converts to list the content of a dataframe cell

    Parameters
    ----------
    sentence : str
        sentence of each line of whatsapp conversation
    
    Returns
    -------
    list
        a list with all its elements inside casted as string
    """
    
    return [str(sentence)]


def filtering(df, column, value):
    """Gets a DataFrame and filter its contents based in specific value and for selected column 

    Parameters
    ----------
    df : dataframe
        Dataframe containing the whatsapp exported conversation
    
    Column : str
        Column name of the dataframe where the changes should take effect
        
    value : str
        Value intended to be kept from specific column
    
    Returns
    -------
    DataFrame
        df filtered and displaying only the rows which contain the value passed to the function
    """
    
    df = df[df[column].str.contains(value)]
    return df


def remover(sentence):
    
    """Gets the content of a cell in the dataframe, in this context a sentence and removes text 

    Parameters
    ----------
    sentence : list
        Sentence or text within the cell in the dataframe 
    
    
    Returns
    -------
    sentence: str
        sentence without the hand-picked words chose to be removed
    """
    try:
        sentence.remove('william')
        sentence.remove('äóimage')
        sentence.remove('äógif')
        
    except ValueError:
        pass
    
    finally:
        for word in sentence:
            if len(word)>=20 and word.startswith('http'):
                sentence.remove(word)

    
    return sentence


def lowercase(sentence):
    
    """Gets the content of a cell in the dataframe, and lower case it 
     Returns a copy of the string in which all case-based characters have been lowercased

    Parameters
    ----------
    sentence : str
        sentence within a df cell
    
    
    Returns
    -------
    sentence: str
        whole sentence in lowercase

    """   
    sentence = sentence.lower()
    return sentence


def tokenizer(sentence):
    """Gets the content of a cell in the dataframe, and tokenize it by white space 

    Parameters
    ----------
    sentence : list
        list[0] Sentence or text within the cell in the dataframe in a list 
    
    
    Returns
    -------
    tokens: list
        list with word tokens, split by white spaces
    """
    tokens = sentence[0].split(" ")
    return tokens

def remove_punctuation(sentence):
    """Gets the content of a cell in the dataframe, and remove punctuation signs
    This function used translate(maketrans(from,to, del))

    Parameters
    ----------
    sentence : str
        Sentence or text within the cell in the dataframe in a list 
    
    
    Returns
    -------
    sentence: str
        sentence without punctuation 
    """
    
    punct = string.punctuation
    return sentence.translate(sentence.maketrans("","", punct))


def corpus_builder(df):
    
    """ Gets the whole dataset and makes one single corpus
     out of every sentence.
    

    Parameters
    ----------
    df : DataFrame
        Complete dataframe containing all the sentences. The dataframe
        must contain a column for messages 'messages'
    
    
    Returns
    -------
    corpus: list
        list containing all the words as tokens for the whole dataset
        
    """ 
    df.reset_index(inplace=True)
    corpus = []
    size = len(df)
    for index in range(0, size):
        corpus.extend(df['message'][index])
    
    return corpus


def word_ocurrences(corpora):
    """ Gets the whole dataset made of tokens in a list and returns its frequency
    

    Parameters
    ----------
    corpora : list
        All tokens that make the datasrt 
    
    
    Returns
    -------
    max_result: tuple
        Tuple containing the most written word (word, quantity)
        
    word_ocurrences: dict
    
        dictionary with Key = Word & Value = Number of times the word is in the
        dataset.
        
    """
    words_ocurrences = defaultdict(int)
    for word in corpora:
        words_ocurrences[word] +=1
    max_result = max(iteritems(words_ocurrences),key= lambda x: x[1])
    return max_result, dict(words_ocurrences)


def rank_words(word_count):
    """ Gets the dictionary with the words and number of times this one is present unordered
        and returns it ordered by value
    

    Parameters
    ----------
    word_count : dictionary
        dictionary containing all the words and its number of times in which
        they are present in the dataset WORD: #Times_in_dataset
    
    
    Returns
    -------
    ranked_words: dictionaty
        Dictionary with the word ocurrence sorted by  Desc - Value
        
        
    """
    ranked_words = {k:v for k, v in sorted(word_count.items(), key = lambda x: word_count[x[0]], reverse= True)}
    return ranked_words

def distinct_words(corpus):
    unique_words =  list(set(corpus))
    size_vocab = len(unique_words)
    return unique_words, size_vocab
    


<h3> Data processing pipeline Applied </h3>
<img src="image-pipe.png">

In [4]:
# Drop NaN
chat_df = drop_nan(chat_df)

# Select only Messages from William

chat_df = filtering(chat_df, 'message', 'William')

# Lower Case the strings
chat_df['message'] = chat_df['message'].apply(lowercase)


# remove punctuaction from every sentence contained in the message column
chat_df['message'] = chat_df['message'].apply(remove_punctuation)


# Convert each cell of the DF in a list made of strings
chat_df['message'] = chat_df['message'].apply(to_list)


# Tokenize the list 
chat_df['message'] = chat_df['message'].apply(tokenizer)


# I put it in the end, so every token is an element of the list, 
# before every sentence was a string, therefore remove failed
chat_df['message'] = chat_df['message'].apply(remover)


corpus = corpus_builder(chat_df)

top, ocurrence = word_ocurrences(corpus)
ranking = rank_words(ocurrence)


chat_df.head(5)

Unnamed: 0,index,timestamp,message
0,0,"[30/07/2018, 10:54:40]","[so, its, william, here]"
1,2,"äó_[30/07/2018, 11:03:","[53, omitted]"
2,3,"[30/07/2018, 11:03:59]","[now, i, am, spamming]"
3,5,"[30/07/2018, 12:50:55]","[i, came, to, have, lunch, with, the, guys]"
4,6,"[30/07/2018, 12:50:59]","[and, i, am, dizzy]"


<h3> Corpus </h3>
<br>
<p> This is the corpus I could use to analyze context, and predict words, is stored under <em>corpus</em> variable, on the other hand, the variable <em>ranking</em> has the corpus ordered by value ocurrence, losing all sequential meaning and more focus to be used in simple analysis </p>

In [5]:
pprint.pprint(corpus[200:400], compact=True, width=100)

['what', 'was', 'that', 'thing', 'that', 'sonja', 'said', 'about', 'chalenge', 'in', 'chatter',
 'if', 'it', 'was', 'in', 'the', 'last', 'meeting', 'i', 'was', 'in', 'auto', 'pilot', 'and',
 'cold', 'i', 'think', 'yeah', 'i', 'can', 'recall', 'that', 'saying', 'that', 'i', 'was', 'really',
 'hungry', 'and', 'cold', 'i', 'think', 'i', 'just', 'remember', 'one', 'thing', 'and', 'it', 'is',
 'you', 'got', 'a', 'diploma', 'and', 'there', 'was', 'a', 'white', 'dog', 'in', 'someone', 'wall',
 'paper', 'hahahaha', 'she', 'said', 'that', 'wow', 'i', 'was', 'really', 'gone', 'sometimes',
 'day', 'after', 'weed', 'i', 'am', 'still', 'quite', 'slow', 'hahahahaha', 'nice', 'way', 'to',
 'see', 'it', 'i', 'like', 'dogs', 'and', 'cats', 'may', 'be', 'thats', 'why', 'it', 'was', 'the',
 'coffee', 'talk', 'i', 'am', 'so', 'looking', 'forward', 'what', 'will', 'happen', 'i', 'want',
 'to', 'see', 'it', 'will', 'give', 'me', 'a', 'new', 'story', 'to', 'tell', 'when', 'having', 'a',
 'beer', '', 'so', 'i

<h3> Dataset ordered by Value and saved as JSON </h3>

In [6]:
with open('sorted_vocabulary_cleaned.json', 'w') as sorted_vocab:
    json.dump(ranking, sorted_vocab, indent=4)

<h3> Unique words in dataset </h3>

In [7]:
dist_words, length = distinct_words(corpus)
with open('unique_words_cleaned.csv', 'w', newline='') as unique_words:
    wr = csv.writer(unique_words)
    wr.writerow(dist_words)

<h3> Corpus with sequential order </h3>
<br>
<i>Note:</i> <p>Contains stop words</p>

In [8]:
with open('dialogue.csv', 'w') as dialogue:
    writer = csv.writer(dialogue)
    writer.writerow(corpus)