# sentiment analysis of articles that mention the suspects

at this stage, we decided to start putting the pieces together. we used a few libraries from nltk to begin processing out the relevant articles from our suspect list and started digging into the articles by hand to put together the full story

### import statements

lots of pandas and tools to help us iterate through the articles, in addition to nltk for scraping and sentiment analysis

In [1]:
import pandas as pd
import os
import chardet
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.tag import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('vader_lexicon')
nltk.download('wordnet')

pd.set_option('display.max_rows', 20)

### user defined functions

here we set up a few functions (and borrowed others from elsewhere within our code) to begin parsing the articles

In [None]:
### function top combine the first and last name into a new column, in case the 
### person is mentioned using their first and last names in the article
def combine_names(df):
    df['Name'] = df['FirstName'] + ' ' + df['LastName']
    return df

### function to crank through the dataframe and compose a new dataframe including 
### the sentiment, the article, the relevant bag of words and the person. 
### returns a new df with all listed elements
def analyze_files_for_sentiment(folder_path, individuals_df, column_name):
    sia = SentimentIntensityAnalyzer()
    results = []

    for file in os.listdir(folder_path):
        if file.endswith('.txt'):
            with open(os.path.join(folder_path, file), 'rb') as f:
                raw_data = f.read()
                encoding = chardet.detect(raw_data)['encoding']
                text = raw_data.decode(encoding)

            tokens = word_tokenize(text)
            tokens = [t for t in tokens if t.lower() not in stopwords.words('english')]

            for item in individuals_df[column_name]:
                sentiment = 0
                count = 0
                keywords = set()

                for token in tokens:
                    if item.lower() in token.lower():
                        sentences = sent_tokenize(text)
                        for sentence in sentences:
                            if token in sentence:
                                tagged_sentence = pos_tag(word_tokenize(sentence))

                                for word, tag in tagged_sentence:
                                    if tag.startswith('N') and wordnet.synsets(word):
                                        keywords.add(word)

                                sentiment += sia.polarity_scores(sentence)['compound']
                                count += 1

                if count > 0:
                    results.append((file, item, round(sentiment/count, 2), list(keywords)))

    df = pd.DataFrame(results, columns=['File Name', 'Entity', 'Sentiment Score', 'Keywords'])

    return df


### function to create a frequency assessment of different words to help us zero in on
### words of interest related to our subjects
def count_list_items(df_col):
    all_items = [item.lower() for sublist in df_col for item in sublist]

    freq_dict = {}
    for item in all_items:
        if item in freq_dict:
            freq_dict[item] += 1
        else:
            freq_dict[item] = 1

    sorted_freq = sorted(freq_dict.items(), key=lambda x: x[1], reverse=True)

    for item, freq in sorted_freq:
        print(f"{item}: {freq}")
        

### creating the sentiment dataframe on the suspected gastech employees

we process the suspects using the above functions to create a dataframe that we can later use to parse through the relevant articles

In [None]:
suspects_df = pd.read_csv('suspects_frame.csv')

suspects_df = combine_names(suspects_df)

display(suspects_df)


Unnamed: 0.1,Unnamed: 0,LastName,FirstName,BirthDate,BirthCountry,Gender,CitizenshipCountry,CitizenshipBasis,CitizenshipStartDate,PassportCountry,PassportIssueDate,PassportExpirationDate,CurrentEmploymentType,CurrentEmploymentTitle,CurrentEmploymentStartDate,EmailAddress,MilitaryServiceBranch,MilitaryDischargeType,MilitaryDischargeDate,Name
0,0,Mies Haber,Ruscella,1964-04-26,Kronos,Female,Kronos,BirthNation,1964-04-26,,,,Administration,Assistant to Engineering Group Manager,2003-04-02,Ruscella.Mies.Haber@gastech.com.kronos,ArmedForcesOfKronos,HonorableDischarge,1984-10-01,Ruscella Mies Haber
1,17,Vann,Isia,1986-12-13,Kronos,Male,Kronos,BirthNation,1986-12-13,,,,Security,Perimeter Control,2007-12-14,Isia.Vann@gastech.com.kronos,ArmedForcesOfKronos,GeneralDischarge,2007-10-01,Isia Vann
2,19,Bodrogi,Loreto,1989-04-17,Kronos,Male,Kronos,BirthNation,1989-04-17,,,,Security,Site Control,2013-08-17,Loreto.Bodrogi@gastech.com.kronos,ArmedForcesOfKronos,HonorableDischarge,2008-10-01,Loreto Bodrogi
3,20,Cocinaro,Hideki,1980-12-25,Tethys,Male,Tethys,BirthNation,1980-12-25,Tethys,2013-05-25,2023-05-24,Security,Site Control,2010-01-01,Hideki.Cocinaro@gastech.com.kronos,TethanDefenseForceArmy,HonorableDischarge,2009-10-01,Hideki Cocinaro
4,21,Osvaldo,Hennie,1988-05-31,Kronos,Male,Kronos,BirthNation,1988-05-31,,,,Security,Perimeter Control,2011-06-07,Hennie.Osvaldo@gastech.com.kronos,ArmedForcesOfKronos,GeneralDischarge,2010-10-01,Hennie Osvaldo
5,23,Mies,Minke,1992-11-19,Kronos,Male,Kronos,BirthNation,1992-11-19,,,,Security,Perimeter Control,2013-05-22,Minke.Mies@gastech.com.kronos,ArmedForcesOfKronos,GeneralDischarge,2011-10-01,Minke Mies
6,26,Ferro,Inga,1989-06-17,Kronos,Female,Kronos,BirthNation,1989-06-17,,,,Security,Site Control,2013-01-11,Inga.Ferro@gastech.com.kronos,ArmedForcesOfKronos,GeneralDischarge,2012-10-01,Inga Ferro
7,52,Herrero,Kanon,1984-10-03,Tethys,Male,Tethys,BirthNation,1984-10-03,Tethys,2008-10-10,2018-10-09,Security,Badging Office,2008-11-20,Kanon.Herrero@gastech.com.kronos,,,,Kanon Herrero
8,53,Lagos,Varja,1976-05-01,Tethys,Female,Tethys,BirthNation,1976-05-01,Tethys,2013-07-07,2023-07-06,Security,Badging Office,2006-10-01,Varja.Lagos@gastech.com.kronos,,,,Varja Lagos


In [None]:
try:
    folder_path = 'C:/Users/Andy/PycharmProjects/pythonProject/The_Last_Stand/TP-1_Kronos/articles'
    column_name = 'FirstName'
    firstName_df = analyze_files_for_sentiment(folder_path, suspects_df, column_name)
except:
    folder_path = 'C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles'
    column_name = 'FirstName'
    firstName_df = analyze_files_for_sentiment(folder_path, suspects_df, column_name)

# display(firstName_df)

In [None]:
try:
    folder_path = 'C:/Users/Andy/PycharmProjects/pythonProject/The_Last_Stand/TP-1_Kronos/articles'
    column_name = 'LastName'
    lastName_df = analyze_files_for_sentiment(folder_path, suspects_df, column_name)
except:
    folder_path = 'C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles'
    column_name = 'LastName'
    lastName_df = analyze_files_for_sentiment(folder_path, suspects_df, column_name)

# display(lastName_df)

In [None]:
sentiment_df = pd.concat([firstName_df, lastName_df], axis=0)

pd.set_option('display.max_rows', None)

display(sentiment_df)

sentiment_df.to_csv('sentiment_df.csv')

Unnamed: 0,File Name,Entity,Sentiment Score,Keywords
0,2.txt,Inga,0.53,"[answer, equipment, CARE, TOOK, conflict, miss..."
1,241.txt,Isia,-0.79,"[members, accident, government, results, ACCID..."
2,302.txt,Inga,0.48,"[conflict, Turkey, group, Singapore, teams, ta..."
3,377.txt,Isia,-0.8,"[error, members, accident, government, conduct..."
4,443.txt,Inga,0.66,"[care, conflict, missions, Turkey, group, Sing..."
5,48.txt,Isia,-0.86,"[error, members, accident, government, Speaks,..."
6,578.txt,Isia,-0.86,"[International, ABILA, NEAR, TRAFFIC, error, m..."
7,690.txt,Isia,-0.86,"[International, ABILA, NEAR, TRAFFIC, error, m..."
8,701.txt,Inga,-0.22,"[assistance, WORRYING, conflict, missions, Tur..."
9,756.txt,Isia,-0.86,"[International, ABILA, NEAR, TRAFFIC, error, m..."


### bag of words

we decided to include a bag of words to just see what kinds of things were being mentioned. this helps us back track to further pare down by articles that dont include words we want to see for relevance later

In [None]:
count_list_items(sentiment_df['Keywords'])

years: 40
girl: 34
members: 32
water: 31
abila: 30
group: 27
year: 25
family: 24
hours: 23
police: 21
employee: 19
purchases: 18
tractor: 18
cancer: 17
protectors: 16
benzene: 16
today: 15
father: 15
gas: 15
drinking: 14
front: 14
hardware: 14
government: 13
security: 13
guard: 13
accident: 12
signs: 12
poisoning: 12
traffic: 11
events: 11
friends: 11
time: 11
yesterday: 11
operations: 11
images: 10
meeting: 10
store: 10
life: 10
city: 10
toxins: 9
drilling: 9
pictures: 9
name: 9
news: 8
leader: 8
guards: 8
fathers: 8
man: 8
refers: 8
leukemia: 8
shares: 7
company: 7
town: 7
toxin: 7
result: 6
vehicle: 6
neighbor: 6
activist: 6
reading: 6
kidnapping: 6
may: 6
soybeans: 6
making: 6
martyr: 6
june: 6
morning: 6
julian: 6
arrest: 6
hands: 6
disease: 6
home: 6
care: 5
conflict: 5
turkey: 5
kosovo: 5
near: 5
driver: 5
error: 5
deaths: 5
posters: 5
toll: 5
founder: 5
evening: 5
death: 5
icon: 5
building: 5
capitol: 5
illnesses: 5
pollution: 5
minister: 5
indication: 5
embers: 5
campaign: 5
h