## Coding Discussion 4
#### Alia Abdelkader

#### Setup

In [1]:
import os
import pandas as pd
import numpy as np

# Change working directory to the location of the cloned repository, where the data file is located
os.chdir('/Users/Alia/Documents/Github/coding_discussions_ppol564_fall2021/04_coding_discussion/Data')

#### Import texts

In [2]:
df = pd.read_csv("stop_words.csv")
stop_words = df["word"].to_list()

file1 = open("aljazeera-khashoggi.txt")
aljazeera = file1.read()
file1.close()

file2 = open("bbc-khashoggi.txt")
bbc = file2.read()
file2.close()

file3 = open("breitbart-khashoggi.txt")
breitbart = file3.read()
file3.close()

file4 = open("cnn-khashoggi.txt")
cnn = file4.read()
file4.close()

file5 = open("fox-khashoggi.txt")
fox = file5.read()
file5.close()

#### Define functions

In [3]:
def tokenize(text=None):
    ''' This function creates a simplified version of string
    Args:
        text (str): a text string
    Returns:
        str: the simplified text string without capitalization or punctuation
    '''
    text = text.lower()
    text = (text
            .replace('.','') # Remove periods
            .replace('"','') # Remove quotation marks
            .replace('“','') # Remove special quotation marks
            .replace('(','') # Remove open paren
            .replace(')','') # Remove close paren
            .replace(',','') # Remove commas
            .replace('-','') # Remove short dash
            .replace('—','') # Remove long dash
            .replace('\'','') # Remove apostrophe
            .replace('[','') # Remove open bracket
            .replace(']','') # Remove close bracket
            .replace('?','') # Remove question mark
            .replace('!','') # Remove exclamation mark
            )
    text_list = text.split()
    text_list2 = [word for word in text_list if word not in stop_words]
    return text_list2

def convert_text_to_dtm(text):
    ''' This function converts text into a document term matrix
    Args:
        text (str): a text string
    Returns:
        df: A document term matrix (DTM) of the input text
    '''
    wcount = dict()
    for word in tokenize(text):
        if word in wcount:
            wcount[word][0] += 1
        else:
            wcount[word] = [1]
    return pd.DataFrame(wcount)

def gen_DTM(texts=None):
    ''' This function generates a document term matrix of the input texts
    Args:
        texts (list): a list containing text string objects
    Returns:
        df: A document term matrix (DTM) of the input texts
    '''
    DTM = pd.DataFrame()
    for text in texts:
        entry = convert_text_to_dtm(text)
        DTM = DTM.append(pd.DataFrame(entry),ignore_index=True,sort=True) # Row bind

    DTM.fillna(0, inplace=True) # Fill in any missing values with 0s (i.e. when a word is in one text but not another)
    return DTM

def cosine(a,b):
    ''' This function calculates cosine similarity between two strings
    Args:
        a (str) and b (str)
    Returns:
        float: the cosine similarity between the two strings
    '''
    cos = np.dot(a,b)/(np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(b,b)))
    return cos

def cos_matrix(texts,text_names):
    ''' This function returns a matrix (data frame) of the cosine similarities between a set of strings
    Args:
        texts (list): a list containing text string objects
        text_names (list): a list of strings that correspond to the names of the objects in "texts"
    Returns:
        df: A matric of similarities between the input texts
    '''
    df_sim = pd.DataFrame(columns = text_names)
    
    count_text1 = 0
    
    for text1 in texts:

        count_text2 = 0

        for text2 in texts:
            DTM = gen_DTM([text1,text2])
            a = DTM.iloc[0].values
            b = DTM.iloc[1].values

            cos = cosine(a,b)

            df_sim.loc[text_names[count_text1],text_names[count_text2]]=cos

            count_text2 += 1

        count_text1+=1
        
    return df_sim

#### Text Analysis

In [4]:
texts = [aljazeera,bbc,breitbart,cnn,fox]
text_names = ['aljazeera','bbc','breitbart','cnn','fox']

# Show matrix of cosine similarities between texts
cos_matrix(texts,text_names)

Unnamed: 0,aljazeera,bbc,breitbart,cnn,fox
aljazeera,1.0,0.678938,0.567446,0.533123,0.678255
bbc,0.678938,1.0,0.574092,0.503919,0.628533
breitbart,0.567446,0.574092,1.0,0.357928,0.536889
cnn,0.533123,0.503919,0.357928,1.0,0.525179
fox,0.678255,0.628533,0.536889,0.525179,1.0


The cosine similarities between the text strings are shown in the matrix above. Cosine similarities closer to 1 indicate that the texts are more similar. The largest similarity appears to be between Aljazeera and BBC (cos = 0.679) indicating that based on the analysis of terms, they reported on Khashoggi's murder in a similar way. The smallest similarity was between CNN and Breitbart (cos = 0.358) indicating that the two sites reported on the murder more differently. Because this was a very specific event, only so many words can be used to describe what happened; to some degree, reporting will rely on the same set of objective facts like locations, names, and dates. Even outlets that are reporting on the event from very different angles will use a lot of the same words.

In [5]:
# Add some additional stopwords, which are likely to appear in all articles
stop_words.append('jamal')
stop_words.append('khashoggi')
stop_words.append('khashoggis')
stop_words.append('president')
stop_words.append('recip')
stop_words.append('tayyip')
stop_words.append('erdogan')
stop_words.append('turkey')
stop_words.append('turkish')
stop_words.append('istanbul')

# Rerun the similarity matrix
cos_matrix(texts,text_names)

Unnamed: 0,aljazeera,bbc,breitbart,cnn,fox
aljazeera,1.0,0.636934,0.478101,0.371684,0.583179
bbc,0.636934,1.0,0.507766,0.377156,0.553408
breitbart,0.478101,0.507766,1.0,0.202937,0.469615
cnn,0.371684,0.377156,0.202937,1.0,0.333141
fox,0.583179,0.553408,0.469615,0.333141,1.0


After removing these additional words, the cosine similarity among all articles decreases. However, some articles see a much more notable decrease in similarity. Compared to the original matrix above (with only the standard list of stopwords excluded), the CNN article now appears to be quite dissimilar to all others; its cosine similarity to all other articles does not exceed 0.4. Between CNN and Breitbart, the original cosine similarity was 0.358 and dropped to 0.217. This shows that after eliminating some more common words, CNN actually reports on this event in quite a different way from the other outlets.