In [None]:
import pandas as pd
import re

>* We are going to use email data from program-l mailing list to build edges and nodes for the graph.
>* https://www.freelists.org/archive/program-l
> * Let's read file using `pd.read_feather` 

>* feather file is a fast, lightweight, and easy-to-use binary file format for storing data frames. 
>* Downloading pre-requisite libraries may necessary to read feather file. 

In [None]:
data=pd.read_feather('sample-feather.feather')

> * Let's check the columns

In [None]:
data.columns

>* Let's check datatypes

In [None]:
data.dtypes

>* Let's check the first few rows of the dataframe

In [None]:
data.head()

>* In `data` dataframe, we have six columns, each representing as follows:
    
    * `thread_id` : unique id for each thread
    * `thread_name` : the first subject of the email
    * `body` : the content of the email 
    * `account` : the email account of the sender 
    * `url` : the url of the email
    * `date` : the date of the email 

> * Think of thread as an email conversation. `thread_id` is the unique id for the email conversation.

>* Let's check which thread has the most number of accounts involved in the conversation.

In [None]:
data['account'].apply(lambda x: len(x)).sort_values(ascending=False)

> * The index of 39 has 43 users involved in the conversation.
> * Let's see who are the users involved in the conversation.

In [None]:
data.loc[39, 'account']

>* We can see some of the users are repeating, meaning they are involved in the conversation multiple times.

>* We want to see the unique users involved in the conversation.
>* To do so, we want to use `nunique()` function to get the number of unique elements in `pd.Series` object.
> * So, we have to convert the list into `pd.Series` object.

In [None]:
data['account'].apply(lambda x: pd.Series(x).nunique()).sort_values(ascending=False)

> * We can still find the index 39 has the most unique users involved in the conversation.
> * But the third most unique users involved in the conversation is different from the most users involved in the conversation.

> * Let's do text mining on the `body` column to find the most common words used in the conversation.
> * To do so, let's import necessary libraries we practiced in the previous classes.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

>* We learned the importance of pre-processing before doing text mining.
> * Lowercasing, removing punctuation, removing stop words, and tokenization are the most common pre-processing steps.

>* The body column is a list. We have to join the strings in the list to make it a single string.

In [None]:
data['body'].apply(lambda x: len(x))

In [None]:
data['body-str'] = data['body'].apply(lambda x: ' '.join(x))

>* Let's lowercase the body column first.

In [None]:
data['body-lower']=data['body-str'].apply(lambda x: x.lower())

In [None]:
print(data['body-str'].iloc[0]) #before lowercasing

In [None]:
print(data['body-lower'].iloc[0])

>* Okay! lowercasing is done. Now, let's remove the stopwords

In [None]:
stop=stopwords.words('english')
#loading stopwords in the variable named stop

In [None]:
data['stopword']=data['body-lower'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop]))
#The lambda function takes each row of the 'body-lower' column, splits it into a list of words, 
#and then joins the words back together into a string, excluding any words that are in the 'stop' list.

In [None]:
data['body-lower'].iloc[98]

In [None]:
data['stopword'].iloc[98]

> * This time, let's do tokenization.

In [None]:
data['token']=data['stopword'].apply(lambda x: word_tokenize(x))

In [None]:
data['token'].iloc[98][:10]

>* Finally let's get rid of the punctuation.

In [None]:
data['punct_token']=data['token'].apply(lambda x: [word for word in x if word.isalnum()])
#if the string is alphanumeric, it is included in the list

In [None]:
data['punct_token'].iloc[98][:10]

>* We are interested in finding the most common words used in the thread (email conversation) index 39.

In [None]:
from collections import Counter

In [None]:
Counter(data['punct_token'].iloc[98]).most_common(10)

> * Let's see which thread has been alive for the longest time.
> * Some email lasts for a few days, some for a few months, and some for a few years.
> * We can calculate the time difference between the first and the last email of the thread.

In [None]:
data['date'].apply(lambda x: x.max())
#max() function will return the latest date 

In [None]:
data['date'].apply(lambda x: x.min())
#min() function will return the earliest date

In [None]:
data['date'].apply(lambda x: x.max()-x.min()).sort_values(ascending=False) 
#combining max() and min() function together will reutrn the difference between the latest and earliest date

> * Okay, the thread with index 6 has been conversing more than 1000 days!

>* Let's see what users have been talking about in the thread with index 6.

In [None]:
data.loc[6, 'body']

In [None]:
data.loc[6, 'date']
#The earlist date of this conversation is 2015-12-22 and the latest date is 2020-09-25

>* Let's see the most common words used in the thread that has been alive for the longest time.

In [None]:
Counter(data['punct_token'].iloc[6]).most_common(10)

>* Let's jump into the network part of this data.
>* Always remember there are three main components of a network: nodes, edges, and attributes.

In [None]:
import networkx as nx

>* How do you want to design the graph with the given data?

In [None]:
G=nx.path_graph(5)
nx.draw(G)

In [None]:
C=nx.complete_graph(5)
nx.draw(C)

> * If we think about directionality, the path graph will look like below

In [None]:
G=nx.path_graph(5, create_using=nx.DiGraph())
nx.draw(G)

>* But given the nature of back-and-forth conversation in the email, there is high likelihood that the graph will be undirected.

In [None]:
C_directed=nx.complete_graph(5, create_using=nx.DiGraph())
nx.draw(C_directed)

> * Let's think about nodes
> * Where can we get the nodes from? It is in the `account` column but data is in the list object.

In [None]:
data['account']

>* How many unique nodes are there in the data?

In [None]:
pd.Series([item for sublist in data['account'] for item in sublist]).nunique()

> * Let's build edges between users in the conversation (thread).
> * To do so, we will use the `account` column and iterate over the rows to created edges between users.

In [None]:
#We will need a combination of all the accounts in the 'account' column to create the edges of the graph
#We will use itertools.combinations to create the combination

import itertools
edges=[]
for idx, val in data['account'].items():
    edges.extend(list(itertools.combinations(val, 2)))

In [None]:
edges[:10]

>* Let's get rid of the self-loops.

In [None]:
edges_loop = [edge for edge in edges if edge[0] != edge[1]]

In [None]:
edges_loop[:10]

>* We can also get rid of the duplicate edges if we want to design the graph as an unwieghted graph.

In [None]:
edges_loop = list(set(edges_loop))

In [None]:
edges_loop[:10]

>* Let's see who has the highest degree centrality in the graph.

In [None]:
degree={}
for element in pd.Series([item for sublist in data['account'] for item in sublist]).unique():
    count=0
    for edge in edges_loop:
        if element in edge:
            count+=1
    degree[element]=count    

> * To sort degree based on the value of the degree, we can use `sorted` function.

In [None]:
sorted_x = sorted(degree.items(), key=lambda k: k[1], reverse=True)
sorted_dict = dict(sorted_x)
dict(list(sorted_dict.items())[:10])

In [None]:
degree_centrality={}
for key, value in sorted_dict.items():
    degree_centrality[key]=value/(len(pd.Series([item for sublist in data['account'] for item in sublist]).unique())-1)

>* `jacobk` has the highest degree centrality in the graph.
>* Let's compare `jacobk` degree centrality

In [None]:
degree_centrality['jacobk']

>* Practice

>* This time, we want to subset the data to only include the conversation that has involved `jacobk`.
>* Hint! `isin` function can be useful. Remember `isin()` function is from `pd.Series` object.
>* Also, try `.apply()` and `lambda` function.

>* Let's put the result in `jacobkdf` variable.

In [None]:
#YOUR CODE HERE

>* We are curious about the most common words that `jacobk` has used in the conversation.
>* We have to use `body` column because in `punct-token` we already joined all the strings in the `body` column.
>* The strings in `body` column follow the order in the `account` column, meaning the first string in `body` column has been sent by the first user in the `account` column. 

>* Let's find the index (order) of `jacobk` in the `account` column in the `jacobkdf` dataframe.

In [None]:
#YOUR CODE HERE

>* Let's print what `jacobk` sent in the conversation.
>* You can use the index (order) found in the previous step.

In [None]:
#YOUR CODE HERE

>* Okay, `jacobk` has sent 31 emails in 13 different threads.

>* Let's do text mining:
>* (1) Lowercasing
>* (2) Tokenization
>* (3) Removing stopwords
>* (4) Removing punctuation

>* (1) Lowercasing

In [None]:
#YOUR CODE HERE

> * (2) Tokenization

In [None]:
#YOUR CODE HERE

>* (3) Removing stopwords

In [None]:
#YOUR CODE HERE

> * (4) Removing punctuation

In [None]:
#YOUR CODE HERE

>* If we did all the pre-processing steps correctly, we can find the most common words used by `jacobk` in the conversation.
>* Q. What are the most 10 common words used by `jacobk`?

In [None]:
#YOUR CODE HERE

>* Let's build edges from the `jacobkdf` dataframe.
>* To do so, let's iterate over the rows

In [None]:
#YOUR CODE HERE

>* Let's get rid of the self-loops.

In [None]:
#YOUR CODE HERE

>* Let's get rid of the duplicate edges.

In [None]:
#YOUR CODE HERE

>* How many edges are there in the final graph?

In [None]:
#YOUR CODE HERE