## Topic Modelling Newspapers

We are going to use this notebook to topic model our newspaper corpus. We start by setting up our imports.

In [1]:
import gensim



In [2]:
import spacy
import os

In [3]:
import xml.etree.ElementTree as ET

### Accessing xml files

In the following lines of code, we are going to assemble the important information from the xml files. The following lines of code iterates through every XML file and accesses it. But we only add it to our corpus if it is censorship related. The following method identifies this.

In [4]:
# corpus = {}

In [5]:
def is_censorship(text):
    # finds out whether censorship or not; for each additional word we wish to add to constrain our group,
    # we add 'or "word" in text' before the final colon in the immediate line below
    if "censor" in text or "censorship" in text:
        return True
    else:
        return False

In [6]:
def is_year(text):
    years = ["1918"]
    for year in years:
        if year in text:
            return True
    return False

In [7]:
is_censorship("censor censorship suppress ban") # Here we can display individual texts in order to see if they fit within our 'is_censorship' group as defined above

True

In [8]:
texts = [] # We are creating a list called "text"

In [9]:
# i = 0 # Here we include the entire NYT database in our corpus
# files = {}
# for folder in os.listdir("NYT"):
#     for filename in os.listdir("NYT/" + folder):
#         if filename.endswith(".xml"):
#             tree = ET.parse("NYT/" + folder + "/" +filename)
#             root = tree.getroot()
#             try:
#                 if is_censorship(root[-1].text):
#                     files[filename] = []
#                     files[filename].append(root[-1].text)
#                     files[filename].append(root[3].text)
#                     files[filename].append(root[4].text)
#                     # add it to corpus
#             except IndexError:
#                 continue

In [10]:
i = 0 # Here we selectively add folders to our corpus
files = {}
folders = ["NYT/sm_55428_1097/", "NYT/sm_55428_1098/", "NYT/sm_55428_1099/", "NYT/sm_55428_1100/",  "NYT/sm_55428_1101/", "NYT/sm_55428_1102/", "NYT/sm_55428_1103/", "NYT/sm_55428_1104/", "NYT/sm_55428_1105/", "NYT/sm_55428_1106/", "NYT/sm_55428_1109/", "NYT/sm_55428_1110/", "NYT/sm_55428_1111/", "NYT/sm_55428_1112/", "NYT/sm_55428_1113/", "NYT/sm_55428_1114/", "NYT/sm_55428_1116/", "NYT/sm_55428_1117/", "NYT/sm_55428_1118/", "NYT/sm_55428_1119/", "NYT/sm_55428_1120/",]
for folder in folders:
    for filename in os.listdir(folder):
        if filename.endswith(".xml"):
            tree = ET.parse(folder +filename)
            root = tree.getroot()
            try:
                if is_censorship(root[-1].text) and (is_year(root[3].text) or is_year(root[4].text)):
                    files[filename] = []
                    files[filename].append(root[-1].text)
                    files[filename].append(root[3].text)
                    files[filename].append(root[4].text)
                # add it to corpus; '-1' signifies the last element in the .xml file, which is <FullText>
            except IndexError:
                continue
#             i += 1 # for this version we only run 10000 iterations and break after th
#         if i == 10004:
#             break
#     if i == 10004:
#         break    

In [11]:
# i = 0 # Here we selectively add folders to our corpus
# folders = ["NYT/sm_55428_1004/","NYT/sm_55428_1005/"]
# for folder in folders:
#     for filename in os.listdir(folder):
#         if filename.endswith(".txt"):
#             files[filename] = []
#             try:
#                 if is_censorship(filename):
#                     file = open(filename, "r") 
#                     files[filename].append(file.read())
#                 # add it to corpus; '-1' signifies the last element in the .xml file, which is <FullText>
#             except IndexError:
#                 continue
#     i += 1 # for this version we only run 10000 iterations and break after th
#     if i == 10004:
#         break

In [12]:
# os.listdir("NYT/sm_55428_1004/") # to view lists

In [13]:
for file in files: # Here we can see the list of articles that are related to censorship and will be topic modeled
    print(file + "\t" + files[file][1] + "\t" + files[file][2])

sm_55428_1114-19556.xml	Local Licenses Jumped From 8,780 in 1918, Gilchrist Reports.	May 29, 1922
sm_55428_1114-22497.xml	Talbot Mundy's Romance of Hira Singh--Latest Works of Fiction by Robert Stead, Caradoc Evans and, Others CAPEL SION LATEST WORKS OF FICTION THE COW PUNCHER "POILU" THE TEXAN THE THREE STRINGS THE CLOSE-UP NAMI-SAN CHILDREN OF EVE HARTLEY HOUSE WHAT IS LOVE? LATEST WORKS OF FICTION THE UNKNOWN WRESTLER THE LAWS OF CHANCE. ESMERALDA "OVER THERE" MY ERRATIC PAL	Dec 29, 1918
sm_55428_1114-2426.xml	Local Licenses Jumped From 8,780 in 1918, Gilchrist Reports.	May 29, 1922
sm_55428_1114-27612.xml	Dismay in Reichstag Over Separate Peace Overture--Kaiser's Fall Predicted. Urge Unity with Austrian Germans. Vienna Government Only on Paper. Strong Hint to Kaiser.	Oct 31, 1918
sm_55428_1114-41504.xml	Unofficial Report of Honor Given to Rainbow Chief of Staff.	Mar 16, 1918
sm_55428_1116-12013.xml	Captures Over 1,800 of the Red Guard, Whose Forces Are Retreating.	Feb 10, 1918
sm_5

sm_55428_1119-49288.xml	Baker Won't Reveal Details of Program, but Says Work Is Going Ahead. RUSHING PLANES FOR PERSHING GREAT WORK OF AIRPLANES. Major Baird Tells Parliament of the Accomplishments. AERO CLUB WIRES TO BAKER. Wants Him to Make a Public Statement on Our Air Program.	Feb 22, 1918
sm_55428_1120-13372.xml	Fearful in Austria of the Central Powers' Course Toward Them.	Jul 16, 1918
sm_55428_1120-15165.xml	Jan 10, 1918	19180110
sm_55428_1120-17747.xml	Newspapers Look to Fast-Increasing American Forces to TurnScale of War. SUPPRESSION, NOT CENSORSHIP Criticism of the New Attitude Toward Military Articles.	Jun 2, 1918
sm_55428_1120-21801.xml	Pershing, in Awarding Cross to Chaplain of Old 69th, Tells of His Inspiring Bravery. MANY OTHERS ARE CITED Sergeant Frank Gardello Brought Down Two German Planes with a Machine Gun. Idolized by the Soldiers. Other Heroes Rewarded.	Sep 8, 1918
sm_55428_1120-23586.xml	Omitted for Candidates in the Military Course. Work of Regimental Family Unit

In [14]:
len(files)

56

In [15]:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

In [16]:
years = {}

In [17]:
for file in files:
    for month in months:
        try:
            if month in files[file][1]:
                year = files[file][1].split(", ")[1]
                break
            if month in files[file][2]:
                year = files[file][2].split(", ")[1]
                break
        except IndexError:
            continue
    if year in years:
        years[year] += 1
    else:
        years[year] = 1
    print(file + "\t" + year + "\n")

sm_55428_1114-19556.xml	1922

sm_55428_1114-22497.xml	1918

sm_55428_1114-2426.xml	1922

sm_55428_1114-27612.xml	1918

sm_55428_1114-41504.xml	1918

sm_55428_1116-12013.xml	1918

sm_55428_1116-18867.xml	1918

sm_55428_1116-26431.xml	1918

sm_55428_1116-26709.xml	1918

sm_55428_1116-39613.xml	1918

sm_55428_1116-47952.xml	1918

sm_55428_1117-14048.xml	1918

sm_55428_1117-1826.xml	1918

sm_55428_1117-24317.xml	1918

sm_55428_1117-27558.xml	1918

sm_55428_1117-32342.xml	but Freed. Offered to Sail at Once. Arrested

sm_55428_1117-33507.xml	1918

sm_55428_1117-40014.xml	1918

sm_55428_1117-42092.xml	1918

sm_55428_1117-42620.xml	1918

sm_55428_1117-620.xml	1918

sm_55428_1118-10693.xml	1918

sm_55428_1118-19494.xml	1918

sm_55428_1118-20109.xml	1918

sm_55428_1118-41652.xml	1918

sm_55428_1118-42045.xml	1918

sm_55428_1119-14991.xml	1918

sm_55428_1119-15953.xml	1918

sm_55428_1119-17817.xml	1918

sm_55428_1119-19442.xml	1918

sm_55428_1119-21520.xml	1918

sm_55428_1119-31952.xml	1918

sm_5

In [18]:
# for line in open("1004-1119.txt"):
#     try:
#         name, date1, date2 = line.split("\t")
#     except ValueError:
#         print(line)
#     for month in months:
#         if month in date1:
#             try:
#                 if date1.split(", ")[1].isdigit():
#                     year = date1.split(", ")[1]
#             except IndexError:
#                 continue
#             break
#         if month in date2:
#             try:
#                 if date2.split(", ")[1].isdigit():
#                     year = date2.split(", ")[1]
#             except IndexError:
#                 continue
#             break
#     if int(year) in years:
#         years[int(year)] += 1
#     else:
#         years[int(year)] = 1
        
#     print(name + "\t" + year)

In [19]:
years

{'1922': 2,
 '1918': 52,
 'but Freed. Offered to Sail at Once. Arrested': 1,
 'Sweeping Unceasingly Up Fifth Avenue': 1}

In [20]:
# sorted keys
for key, value in sorted(years.items(), key=lambda x: x[0]): 
    print("{} : {}".format(key, value))

1918 : 52
1922 : 2
Sweeping Unceasingly Up Fifth Avenue : 1
but Freed. Offered to Sail at Once. Arrested : 1


In [21]:
# sorted values
for key, value in sorted(years.items(), key=lambda x: x[1]): 
    print("{} : {}".format(key, value))

but Freed. Offered to Sail at Once. Arrested : 1
Sweeping Unceasingly Up Fifth Avenue : 1
1922 : 2
1918 : 52


In [22]:
t=0
f=0

for file in files:
    if is_censorship(files[file][0]):
        
        t+=1
    else:
        f+=1
print("true = ", t, "false = ", f)

true =  56 false =  0


In [23]:
# is_censorship(files[""])

In [40]:
len(files)

56

In [25]:
nlp = spacy.load("en")

In [26]:
# we add some words to the stop word list
for file in files:
    article = []
    doc = nlp(files[file][0].lower())
    for w in doc:
        # if it's not a stop word or punctuation mark, add it to our article!
        if not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I' and not '&' in w.text and not ';' in w.text and not '$' in w.text and len(w.text) > 2 and not ' ' in w.text and not "apos" in w.text and not "quot" in w.text and not "nos" in w.text and not "tlhe" in w.text and not "tlie" in w.text and not "tihe" in w.text and not "thie" in w.text and not "andl" in w.text and not "tile" in w.text and not "tho" in w.text:
            # we add the lematized version of the word
            article.append(w.lemma_)
    files[file].append(article)

In [27]:
# stop word list changelog
# 12/2: added "apos" and "quot" as undesirable remnants of .xml punctuation codings

Now that we have our dirty corpus, let us now clean it.

## Clean Corpus

We will clean it by removing stop words, lemmatizing, removing punctuation, numbers, spaces, etc.

In [28]:
# iterate through corpus, clean code

In [29]:
from gensim.corpora import Dictionary
import gensim
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel

In [30]:
cleaned_texts = []

In [31]:
for file in files:
    cleaned_texts.append(files[file][3])

In [32]:
bigram = gensim.models.Phrases(cleaned_texts)

In [33]:
texts = [bigram[line] for line in cleaned_texts]



In [34]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [43]:
ldamodel = LdaModel(corpus=corpus, num_topics=7, id2word=dictionary, passes=10)

In [44]:
ldamodel.show_topics()

[(0,
  '0.006*"force" + 0.005*"general" + 0.005*"issue" + 0.004*"official" + 0.004*"ally" + 0.004*"information" + 0.004*"russia" + 0.003*"american" + 0.003*"give" + 0.003*"troop"'),
 (1,
  '0.009*"general" + 0.006*"say" + 0.005*"american" + 0.005*"air" + 0.005*"officer" + 0.004*"military" + 0.004*"war" + 0.004*"report" + 0.004*"people" + 0.004*"german"'),
 (2,
  '0.007*"say" + 0.006*"war" + 0.006*"german" + 0.005*"people" + 0.005*"man" + 0.004*"peace" + 0.003*"germany" + 0.003*"country" + 0.003*"state" + 0.003*"fight"'),
 (3,
  '0.007*"man" + 0.005*"come" + 0.005*"american" + 0.004*"war" + 0.004*"division" + 0.004*"german" + 0.004*"woman" + 0.004*"parade" + 0.003*"story" + 0.003*"float"'),
 (4,
  '0.007*"say" + 0.006*"german" + 0.005*"cable" + 0.005*"government" + 0.005*"senator" + 0.004*"line" + 0.004*"germany" + 0.004*"american" + 0.003*"committee" + 0.003*"time"'),
 (5,
  '0.009*"camp" + 0.008*"man" + 0.005*"senator" + 0.004*"government" + 0.004*"think" + 0.004*"war" + 0.004*"countr

In [37]:
# list of stop words, following restriction of punctuation, spaces;
# apos quot nos tlhe tlie tihe thie andl tile tho
