<h2> import packages and dataset </h2>

In [1]:
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import spacy
from nltk.stem import WordNetLemmatizer

In [2]:
df = pd.read_csv('C:/Users/david/Documents/0_kul/1_ma_stds_1/0_mda/project/data/merged.csv', delimiter=',')

In [3]:
print(len(df.Transcript.values[1]))
print(df.Transcript.values[1])

29286
177.	 : It is a fortunate coincidence that precisely at a time when the United Nations is celebrating its first twenty-five years of existence, an eminent jurist so closely linked to our Organization should have been elected to preside over the General Assembly. On behalf of the Argentine Government, it is a pleasure for me to congratulate Your Excellency, Mr. President, on this felicitous choice.
178.	Through you I should also like to express the appreciation of the Argentine delegation to Mrs. Angie BrooksRandolph of Liberia, for the work she performed as President of the twenty-fourth session.
179.	From this rostrum, the Argentine Government wishes to express to the delegation of the United Arab Republic its regret and sympathy upon the recent death of His Excellency President Gamal Abdel Nasser. The loss of this outstanding statesman has not only plunged the Arab world into mourning, but has also deeply grieved all those who greatly valued his capacity to contribute actively 

<h2> cleaning raw speeches </h2>

In [4]:
def clean_text(text):
    text = text.lower()
    text = re.sub("\n", '', text)
    text = re.sub("\t",'',text)
    text = re.sub(r'[0-9]', '', text) #ideally this removes only line number such as "xxx." 9/11 can be an important nb
    text = re.sub(r'\[,!.*?\]', '', text)
    text = re.sub(r'[%s]' % re.escape(string.punctuation), ' ', text) # removes the possibility to tokenize by sentence
    return text

In [5]:
df['Transcript'] = df['Transcript'].apply(lambda x: clean_text(str(x)))

In [7]:
print(len(df.Transcript.values[1]))
print(df.Transcript.values[1])

28965
    it is a fortunate coincidence that precisely at a time when the united nations is celebrating its first twenty five years of existence  an eminent jurist so closely linked to our organization should have been elected to preside over the general assembly  on behalf of the argentine government  it is a pleasure for me to congratulate your excellency  mr  president  on this felicitous choice  through you i should also like to express the appreciation of the argentine delegation to mrs  angie brooksrandolph of liberia  for the work she performed as president of the twenty fourth session  from this rostrum  the argentine government wishes to express to the delegation of the united arab republic its regret and sympathy upon the recent death of his excellency president gamal abdel nasser  the loss of this outstanding statesman has not only plunged the arab world into mourning  but has also deeply grieved all those who greatly valued his capacity to contribute actively to the establi

<h2> removing stopwords & lemmatization </h2>

In [8]:
def remove_stopwords(text):
    filtered = []
    stopwords_corpus = nltk.corpus.stopwords.words('english')
    stopwords_additional = ['united','nations','nation', 'international','society','organization','organizations',
                            'relations','relation','global','charter','general','assembly','/n','/t','/n/n']
    stop_words = stopwords_corpus + stopwords_additional
    stemmer = WordNetLemmatizer()
    word_tokens = word_tokenize(text)
    for w in word_tokens:
        if w not in stop_words:
            w = stemmer.lemmatize(w)
            filtered.append(w)
    filtered_doc = ' '.join(str(i) for i in filtered)
    return filtered_doc

In [9]:
df['Transcript'] = df['Transcript'].apply(lambda x: remove_stopwords(str(x)))

In [10]:
print(len(df.Transcript.values[1]))
print(df.Transcript.values[1])

17545
fortunate coincidence precisely time celebrating first twenty five year existence eminent jurist closely linked elected preside behalf argentine government pleasure congratulate excellency mr president felicitous choice also like express appreciation argentine delegation mr angie brooksrandolph liberia work performed president twenty fourth session rostrum argentine government wish express delegation arab republic regret sympathy upon recent death excellency president gamal abdel nasser loss outstanding statesman plunged arab world mourning also deeply grieved greatly valued capacity contribute actively establishment peace middle east created founder two fundamental purpose mind one hand solemnly formulate basic principle system establish legal framework keeping political social need immediate postwar era objective result historical experience mankind interest aspiration civilized way renounce enunciation purpose principle thus considered sign moral evolution maturity people need

<h2> Lemmatization Khachatur </h2>

In [11]:
nlp = spacy.load('en_core_web_sm')
def lemmatizer(text):        
    sent = []
    doc = nlp(text)
    for word in doc:
        sent.append(word.lemma_)
    return " ".join(sent)

In [12]:
df["sentence_lemmatize"] =  df['Transcript'].apply(lambda x: lemmatizer(x))

In [13]:
df['sentence_lemmatize_clean'] = df['sentence_lemmatize'].str.replace('-PRON-', '')

In [14]:
print(len(df.sentence_lemmatize_clean.values[1]))
print(df.sentence_lemmatize_clean.values[1])

17084
fortunate coincidence precisely time celebrate first twenty five year existence eminent jurist closely link elect preside behalf argentine government pleasure congratulate excellency mr president felicitous choice also like express appreciation argentine delegation mr angie brooksrandolph liberia work perform president twenty fourth session rostrum argentine government wish express delegation arab republic regret sympathy upon recent death excellency president gamal abdel nasser loss outstanding statesman plunge arab world mourning also deeply grieve greatly value capacity contribute actively establishment peace middle east create founder two fundamental purpose mind one hand solemnly formulate basic principle system establish legal framework keep political social need immediate postwar era objective result historical experience mankind interest aspiration civilized way renounce enunciation purpose principle thus consider sign moral evolution maturity people need rely upon stable

<h2> Saving the file </h2>

In [16]:
print(df.size)

56658


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8094 entries, 0 to 8093
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Unnamed: 0                8094 non-null   int64 
 1   Year                      8094 non-null   object
 2   Session                   8094 non-null   object
 3   Country                   8094 non-null   object
 4   Transcript                8094 non-null   object
 5   sentence_lemmatize        8094 non-null   object
 6   sentence_lemmatize_clean  8094 non-null   object
dtypes: int64(1), object(6)
memory usage: 442.8+ KB


In [None]:
df.to_csv('C:/Users/david/Documents/0_kul/1_ma_stds_1/0_mda/project/data/transcript_preprocessed.csv',encoding='utf-8')