## Topic Modelling Newspapers

We are going to use this notebook to topic model our newspaper corpus. We start by setting up our imports.

In [1]:
import gensim



In [2]:
import spacy
import os

In [3]:
import xml.etree.ElementTree as ET

### Accessing xml files

In the following lines of code, we are going to assemble the important information from the xml files. The following lines of code iterates through every XML file and accesses it. But we only add it to our corpus if it is censorship related. The following method identifies this.

In [4]:
# corpus = {}

In [5]:
def is_censorship(text):
    # finds out whether censorship or not; for each additional word we wish to add to constrain our group,
    # we add 'or "word" in text' before the final colon in the immediate line below
    if "censor" in text or "censorship" in text:
        return True
    else:
        return False

In [6]:
def is_year(text):
    years = ["1914"]
    for year in years:
        if year in text:
            return True
    return False

In [7]:
is_censorship("censor censorship suppress ban") # Here we can display individual texts in order to see if they fit within our 'is_censorship' group as defined above

True

In [8]:
texts = [] # We are creating a list called "text"

In [9]:
# i = 0 # Here we include the entire NYT database in our corpus
# files = {}
# for folder in os.listdir("NYT"):
#     for filename in os.listdir("NYT/" + folder):
#         if filename.endswith(".xml"):
#             tree = ET.parse("NYT/" + folder + "/" +filename)
#             root = tree.getroot()
#             try:
#                 if is_censorship(root[-1].text):
#                     files[filename] = []
#                     files[filename].append(root[-1].text)
#                     files[filename].append(root[3].text)
#                     files[filename].append(root[4].text)
#                     # add it to corpus
#             except IndexError:
#                 continue

In [10]:
i = 0 # Here we selectively add folders to our corpus
files = {}
folders = ["NYT/sm_55428_1097/", "NYT/sm_55428_1098/", "NYT/sm_55428_1099/", "NYT/sm_55428_1100/",  "NYT/sm_55428_1101/", "NYT/sm_55428_1102/", "NYT/sm_55428_1103/", "NYT/sm_55428_1104/", "NYT/sm_55428_1105/", "NYT/sm_55428_1106/", "NYT/sm_55428_1109/", "NYT/sm_55428_1110/", "NYT/sm_55428_1111/", "NYT/sm_55428_1112/", "NYT/sm_55428_1113/", "NYT/sm_55428_1114/", "NYT/sm_55428_1116/", "NYT/sm_55428_1117/", "NYT/sm_55428_1118/", "NYT/sm_55428_1119/", "NYT/sm_55428_1120/",]
for folder in folders:
    for filename in os.listdir(folder):
        if filename.endswith(".xml"):
            tree = ET.parse(folder +filename)
            root = tree.getroot()
            try:
                if is_censorship(root[-1].text) and (is_year(root[3].text) or is_year(root[4].text)):
                    files[filename] = []
                    files[filename].append(root[-1].text)
                    files[filename].append(root[3].text)
                    files[filename].append(root[4].text)
                # add it to corpus; '-1' signifies the last element in the .xml file, which is <FullText>
            except IndexError:
                continue
#             i += 1 # for this version we only run 10000 iterations and break after th
#         if i == 10004:
#             break
#     if i == 10004:
#         break    

In [11]:
# i = 0 # Here we selectively add folders to our corpus
# folders = ["NYT/sm_55428_1004/","NYT/sm_55428_1005/"]
# for folder in folders:
#     for filename in os.listdir(folder):
#         if filename.endswith(".txt"):
#             files[filename] = []
#             try:
#                 if is_censorship(filename):
#                     file = open(filename, "r") 
#                     files[filename].append(file.read())
#                 # add it to corpus; '-1' signifies the last element in the .xml file, which is <FullText>
#             except IndexError:
#                 continue
#     i += 1 # for this version we only run 10000 iterations and break after th
#     if i == 10004:
#         break

In [12]:
# os.listdir("NYT/sm_55428_1004/") # to view lists

In [13]:
for file in files: # Here we can see the list of articles that are related to censorship and will be topic modeled
    print(file + "\t" + files[file][1] + "\t" + files[file][2])

sm_55428_1097-10195.xml	Army Representatives Inspect Animals in Kansas City.	Sep 6, 1914
sm_55428_1097-10340.xml	Printing Soldier's Complaint Said to Have Caused Suspension.	Sep 28, 1914
sm_55428_1097-10669.xml	American Mine Manager, Consul Reports, Will Remain in Mexico for the Present. MANY FLEE FROM YUCATAN Government Will Send East Coast Refugees to New Orleans to Avoid Clash at Galveston.	May 3, 1914
sm_55428_1097-11121.xml	Dec 22, 1914	19141222
sm_55428_1097-1120.xml	Sep 12, 1914	19140912
sm_55428_1097-12252.xml	Oct 30, 1914	19141030
sm_55428_1097-12291.xml	Buildings Wrecked and Inhabitants Thrown Into a Panic ;- No Death Reported.	Sep 25, 1914
sm_55428_1097-13583.xml	Aug 30, 1914	19140830
sm_55428_1097-14662.xml	"Little Better Than Murder," He Calls His Work in Belgian Trenches. HORRORS OF 'MINENWERFER' German Trench Gun Throws a Slow-Moving Shell with Deadly Aim and Effect.	Dec 29, 1914
sm_55428_1097-14848.xml	Honolulu Station Did Not Intentionally Violate Our Neutrality.	Oct 1

sm_55428_1097-39800.xml	Cabinet Is Urged by Diplomats of the Allies to Preserve Neutrality. CRUISERS A SORE POINT Bryan Denies Prediction by Our Ambassador of Massacre ;- Destitution Reported In Palestine.	Aug 27, 1914
sm_55428_1097-40810.xml	Parliament Unanimously Agrees to Greatest Money Demand Ever Made on It. 1,000,000 MORE MEN ALSO War Costing England Nearly $5,000,000 a Day ;- National Debt Increased Over 50 Per Cent.	Nov 17, 1914
sm_55428_1097-41842.xml	Mrs. Schwimmer, Peace Advocate, Says Only U.S. Can Bring About an Armistice. MEN ARE READY TO STOP Thousand of Women Suicides in Warring Countries, She Tells Reform Rabbis.	Dec 8, 1914
sm_55428_1097-43334.xml	Nov 25, 1914	19141125
sm_55428_1097-4469.xml	Jul 26, 1914	19140726
sm_55428_1097-4530.xml	Jul 30, 1914	19140730
sm_55428_1097-45380.xml	Mar 29, 1914	19140329
sm_55428_1097-46158.xml	Member's Assertion that Pamphlet on the War Was Harmful Is Cheered by Hearers.	Nov 26, 1914
sm_55428_1097-46717.xml	African Republic Asks U.S. A

sm_55428_1098-25977.xml	Dec 1, 1914	19141201
sm_55428_1098-26359.xml	To Join Catholic Church, She Says, Answering a Priest's Criticism.	Jan 20, 1914
sm_55428_1098-2643.xml	Aug 4, 1914	19140804
sm_55428_1098-26758.xml	Nov 23, 1914	19141123
sm_55428_1098-26812.xml	Civil Rights Suspended -- Bank Rate Up and Boerse Closed.	Jul 27, 1914
sm_55428_1098-27116.xml	Says Their Testimony Relates to Facts of Which Jurors Are Judges. POINTS RAISED ARE NOVEL Norman Hapgood, Prof. Jenks, and Others Qualify, but Can't Testify -- Shakespeare Ruled Out.	Feb 6, 1914
sm_55428_1098-27525.xml	Tokio Censor Passes Report That Ships Are Already Shelling the German Fort. GERMANS BLOW UP BRIDGES And Raze All Tall Buildings That the Japanese Gunners Might Use for Sightings. RUMORS OF A FIGHT AT SEA Peking Hears That British, French, and Russian Warships Have Joined the Blockade.	Aug 25, 1914
sm_55428_1098-27636.xml	Paris Temps Suggests Japan Is Less Menace to Us Than Germany.	Oct 29, 1914
sm_55428_1098-27669.xml	M

sm_55428_1098-48485.xml	Copenhagen Hears Rumors That They are Being Tested in the Kiel Canal	Oct 14, 1914
sm_55428_1098-48624.xml	Sep 11, 1914	19140911
sm_55428_1098-49347.xml	THE RETURN OF THE PRODIGAL. By May Sinclair. The Macmillan Company. FICTION FOR JUNE	Jun 7, 1914
sm_55428_1098-5616.xml	French and English Lines to be Put Under Same Supervision as Wireless Stations. GERMANS MADE PROTEST Curb on Land Wire Companies Operating Into Canada, Is Their Reported Desire Now.	Aug 14, 1914
sm_55428_1098-5690.xml	Incidents in the Campaign of 1914 Described by a War Correspondent -- Paris Less Than a Year Ago -- Recent Books on European Conflict THE CAMPAIGN OF 1914 IN FRANCE AND BELGIUM. By G. H. Perris. With Maps and Plans by F. F. Perris and Photographs by the Author. New York: Henry Holt & Co. $1.50.	Aug 8, 1915
sm_55428_1098-7844.xml	But Public Can and Does Condemn the Board of Protectors Who Framed It. SKELETONS IN ALL CLOSETS And One Speaker at the Hearing to Remove Names Says There A

sm_55428_1099-33287.xml	Flower of the Indian Army Again Reported on Way for Service in France. CENSOR STOPS DISPATCHES Secret So Well Guarded That Even the Canadian Public Is Kept in Ignorance of It.	Sep 11, 1914
sm_55428_1099-33394.xml	Reference to Northern Campaign "Blacked" Out of the Newpsapers.	Sep 2, 1914
sm_55428_1099-33639.xml	Von Bernstorff Says He Hasn't Complained of Our Shipping War Supplies. WANTS EQUAL CENSORSHIP And Charges That English Cruisers Have Been Sending Requisitions to This City.	Sep 5, 1914
sm_55428_1099-33979.xml	Scenes in a Shell-Smashed City During a Lull in the Bombardment. EXCITING RIDE IN THE DARK. Grim Sentinels Mollified by Gifts of Newspapers and Cigarettes ;- A Welcome at an Inn.	Oct 28, 1914
sm_55428_1099-36253.xml	Work Sent by Cable to The Times Produced at the Coliseum.	Dec 22, 1914
sm_55428_1099-365.xml	Denies That He Ever Said She Could Not Be Defeated.	Nov 21, 1914
sm_55428_1099-36742.xml	British Censors Will Pass Those To and From This Country

sm_55428_1100-21720.xml	New York Times Correspondent Joins the American Corps at the Front. NIGHT WORK IN THE MUD Arduous Service Near the Belgian Border ;- Stiff Drills for Volunteer Chauffeurs.	Dec 20, 1914
sm_55428_1100-21753.xml	National Board by Close Vote Approves White Slave Drama Depicting District Attorney. TELLS OF SMASHING TRUST Pictures Based on Whitman's Investigation Reveal All Sides of Traffic in Women.	Feb 10, 1914
sm_55428_1100-21817.xml	Will Deport and Forfeit Bonds of All Who Go Outside of Funston's Lines. CAPT. MAIGNE RECALLED Secretary Visits Summary Punishment on Retired Officer Who Started for Capital.	May 14, 1914
sm_55428_1100-2206.xml	Its Organization and Methods Explained and Justified by Its General Manager.	Jul 6, 1914
sm_55428_1100-22278.xml	Fletcher's Joining Senate Committee Breaks Existing Tie on Measure.	Mar 21, 1914
sm_55428_1100-22304.xml	House Committee Likely to Approve the Senate Measure.	Sep 24, 1914
sm_55428_1100-22341.xml	Aug 27, 1914	19140827


sm_55428_1100-9984.xml	So Declares Rupert Hughes, Novelist and Playwright, Who Adds That the Writer in This Country Should Give a Just Appraisal of Life for What It Really Is.	May 17, 1914
sm_55428_1101-10012.xml	Oct 9, 1914	19141009
sm_55428_1101-10355.xml	And What They Write Will Be Censored ;- French Rules Issued.	Aug 12, 1914
sm_55428_1101-1152.xml	Sep 1, 1914	19140901
sm_55428_1101-11827.xml	Complications Arise at Washington as Result of Complaint by Germany. CENSOR WOULDN'T HELP HER British Representatives Could Send Cipher Messages by Way of Canada, Escaping Censorship.	Aug 15, 1914
sm_55428_1101-1338.xml	Tuckerton Station Closed by Government, Liner Transmits Messages from Sayville. KEPT IN TOUCH WITH NAUEN Steamship Hiding in Midatlantic Believed to be the Kronprinz Wilhelm. PLANT CAN'T GET PERMIT Goldschmidt Company Obeys Order of Secretary Daniels and Law Gives No Alternative.	Aug 25, 1914
sm_55428_1101-1364.xml	May 10, 1914	19140510
sm_55428_1101-14726.xml	Will Deport and F

sm_55428_1101-44402.xml	Scenes in a Shell-Smashed City During a Lull in the Bombardment. EXCITING RIDE IN THE DARK. Grim Sentinels Mollified by Gifts of Newspapers and Cigarettes ;- A Welcome at an Inn.	Oct 28, 1914
sm_55428_1101-4531.xml	African Republic Asks U.S. Advice About Neutrality of Communication. EMBASSY'S NEWS PUZZLING Radio Experts Unable to Figure Out How Messages Get from Berlin to Sayville.	Aug 26, 1914
sm_55428_1101-46684.xml	Whole Corps Passes Through Brussels ;- Aiming at Belgian Port.	Aug 22, 1914
sm_55428_1101-47269.xml	Aug 3, 1914	19140803
sm_55428_1101-47347.xml	Jul 31, 1914	19140731
sm_55428_1101-48150.xml	Fletcher Unable to Induce Federal Officials to Perform Duties.	Apr 27, 1914
sm_55428_1101-4837.xml	New York Surgeon Says Wounds Are Like Those of Dumdums.	Nov 21, 1914
sm_55428_1101-48833.xml	Oct 10, 1914	19141010
sm_55428_1101-48966.xml	Said He Wouldn't Help War Fund ;- Liked United States Better.	Oct 1, 1914
sm_55428_1101-48971.xml	Paris Hears Tale of Engine 

sm_55428_1103-16080.xml	Hanover Wireless Station Has Longest Sending Radius.	Aug 18, 1914
sm_55428_1103-18661.xml	How They Are Fighting the Germans Pressing on Antwerp.	Oct 2, 1914
sm_55428_1103-19567.xml	J. L. Garvin Sums Up the Prevailing Feeling and Predicts Political Upheaval in Germany.	Sep 30, 1914
sm_55428_1103-24761.xml	Begins Bombardment of Interior Forts, Says Athens ;- May Plan to Force Passage.	Dec 21, 1914
sm_55428_1103-24800.xml	Wrecked in Unusual Circumstances.	Jan 7, 1914
sm_55428_1103-28561.xml	French Correspondent Reports Anti-Semitist Kissing Jewish Scriptures.	Oct 11, 1914
sm_55428_1103-2949.xml	Gertrude Atherton Publishes Her Friend's War Views.	Nov 30, 1914
sm_55428_1103-33390.xml	Mistook Her for a German Cruiser ;- British Ships In New Paint Since. UNPUBLISHED NAVAL NEWS Brought by a Former Officer Includes a Story of Very Young Cadets on Sunken British Vessels.	Oct 5, 1914
sm_55428_1103-37986.xml	Sep 30, 1914	19140930
sm_55428_1103-38103.xml	French and English L

sm_55428_1105-6162.xml	Strict Censorship Now Imposed on All Press Dispatches.	Apr 12, 1914
sm_55428_1105-7738.xml	Robert Blatchford Wants "Common Sense About the War" Taken Up by Parliament. ASSAILS AUTHOR BITTERLY " A Bumptious Merry-Andrew, Hungry for More Notoriety," He Says and Has "Perverted the Truth."	Nov 23, 1914
sm_55428_1105-9497.xml	World's Great Singers Also Denied to American Audiences by Hostilities. HOLMES AND HANSON TALK Lecturer and Concert Director Both Had Some Success Till After Mobilization.	Aug 29, 1914
sm_55428_1106-19091.xml	Jul 28, 1914	19140728
sm_55428_1106-20196.xml	La Prensa Calls Our Policy Unwarranted by International Law.	Apr 24, 1914
sm_55428_1106-29560.xml	None of the Metropolitan Papers Will Be Published Christmas Day.	Dec 22, 1914
sm_55428_1106-30501.xml	Wrecked in Unusual Circumstances.	Jan 7, 1914
sm_55428_1106-32379.xml	Dec 18, 1914	19141218
sm_55428_1106-33950.xml	Sep 16, 1914	19140916
sm_55428_1109-40157.xml	Proposal Opposed, as Well as Favored,

In [14]:
len(files)

599

In [15]:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

In [16]:
years = {}

In [17]:
for file in files:
    for month in months:
        try:
            if month in files[file][1]:
                year = files[file][1].split(", ")[1]
                break
            if month in files[file][2]:
                year = files[file][2].split(", ")[1]
                break
        except IndexError:
            continue
    if year in years:
        years[year] += 1
    else:
        years[year] = 1
    print(file + "\t" + year + "\n")

sm_55428_1097-10195.xml	1914

sm_55428_1097-10340.xml	1914

sm_55428_1097-10669.xml	1914

sm_55428_1097-11121.xml	1914

sm_55428_1097-1120.xml	1914

sm_55428_1097-12252.xml	1914

sm_55428_1097-12291.xml	1914

sm_55428_1097-13583.xml	1914

sm_55428_1097-14662.xml	1914

sm_55428_1097-14848.xml	1914

sm_55428_1097-14988.xml	1914

sm_55428_1097-15103.xml	1914

sm_55428_1097-1525.xml	1914

sm_55428_1097-15352.xml	1914

sm_55428_1097-15452.xml	1914

sm_55428_1097-15522.xml	1914

sm_55428_1097-15528.xml	1914

sm_55428_1097-15880.xml	1914

sm_55428_1097-16563.xml	1914

sm_55428_1097-17322.xml	1914

sm_55428_1097-18708.xml	1914

sm_55428_1097-18880.xml	1914

sm_55428_1097-19234.xml	1914

sm_55428_1097-19694.xml	1914

sm_55428_1097-19905.xml	1914

sm_55428_1097-19994.xml	1914

sm_55428_1097-20117.xml	1914

sm_55428_1097-20815.xml	1914

sm_55428_1097-2138.xml	1914

sm_55428_1097-21628.xml	1914

sm_55428_1097-2168.xml	1914

sm_55428_1097-22232.xml	1914

sm_55428_1097-23270.xml	1914

sm_55428_1097-


sm_55428_1099-33639.xml	1914

sm_55428_1099-33979.xml	1914

sm_55428_1099-36253.xml	1914

sm_55428_1099-365.xml	1914

sm_55428_1099-36742.xml	1914

sm_55428_1099-37173.xml	1914

sm_55428_1099-37602.xml	1914

sm_55428_1099-38015.xml	1914

sm_55428_1099-38370.xml	1914

sm_55428_1099-39359.xml	1914

sm_55428_1099-3964.xml	1914

sm_55428_1099-39707.xml	1914

sm_55428_1099-40846.xml	1914

sm_55428_1099-41143.xml	1914

sm_55428_1099-41541.xml	1914

sm_55428_1099-41562.xml	1914

sm_55428_1099-4270.xml	1914

sm_55428_1099-42885.xml	1914

sm_55428_1099-43067.xml	1914

sm_55428_1099-43254.xml	1914

sm_55428_1099-4423.xml	1914

sm_55428_1099-4447.xml	1914

sm_55428_1099-45220.xml	1914

sm_55428_1099-45470.xml	1914

sm_55428_1099-46116.xml	1914

sm_55428_1099-46669.xml	1914

sm_55428_1099-46722.xml	1914

sm_55428_1099-46914.xml	1914

sm_55428_1099-4725.xml	1914

sm_55428_1099-47989.xml	1914

sm_55428_1099-48165.xml	1914

sm_55428_1099-49721.xml	1914

sm_55428_1099-49771.xml	1914

sm_55428_1099-49


sm_55428_1103-24761.xml	Says Athens ;- May Plan to Force Passage.

sm_55428_1103-24800.xml	1914

sm_55428_1103-28561.xml	1914

sm_55428_1103-2949.xml	1914

sm_55428_1103-33390.xml	1914

sm_55428_1103-37986.xml	1914

sm_55428_1103-38103.xml	1914

sm_55428_1103-39910.xml	1914

sm_55428_1103-42154.xml	1914

sm_55428_1103-44372.xml	1914

sm_55428_1103-44376.xml	1914

sm_55428_1103-45646.xml	1914

sm_55428_1103-46171.xml	1914

sm_55428_1103-46449.xml	1914

sm_55428_1103-48307.xml	1914

sm_55428_1104-10249.xml	1914

sm_55428_1104-1054.xml	1914

sm_55428_1104-10674.xml	1914

sm_55428_1104-11261.xml	1914

sm_55428_1104-12525.xml	1914

sm_55428_1104-16000.xml	1914

sm_55428_1104-16676.xml	1914

sm_55428_1104-1729.xml	1914

sm_55428_1104-17386.xml	1914

sm_55428_1104-17578.xml	1914

sm_55428_1104-17920.xml	1914

sm_55428_1104-19641.xml	1914

sm_55428_1104-23173.xml	1914

sm_55428_1104-23241.xml	1914

sm_55428_1104-24118.xml	1914

sm_55428_1104-24362.xml	1914

sm_55428_1104-25164.xml	1914

sm_55

In [18]:
# for line in open("1004-1119.txt"):
#     try:
#         name, date1, date2 = line.split("\t")
#     except ValueError:
#         print(line)
#     for month in months:
#         if month in date1:
#             try:
#                 if date1.split(", ")[1].isdigit():
#                     year = date1.split(", ")[1]
#             except IndexError:
#                 continue
#             break
#         if month in date2:
#             try:
#                 if date2.split(", ")[1].isdigit():
#                     year = date2.split(", ")[1]
#             except IndexError:
#                 continue
#             break
#     if int(year) in years:
#         years[int(year)] += 1
#     else:
#         years[int(year)] = 1
        
#     print(name + "\t" + year)

In [19]:
years

{'1914': 590,
 'Military Journal Says. MEXICO CITY A VOLCANO Fear Expressed That Outbreak There May Make Intervention Necessary Any Time. EXPLAINED AS ROUTINE War Department Denies That Vera Cruz Assignments Are Significant ;- Chiefs to Rule Mexico.': 2,
 '1915': 2,
 'Poldhu': 1,
 'Killing a Score and Setting Fires. PLAN A VIGOROUS DEFENSE British Force May Be Aiding Belgians ;- Thousands More Flee to Holland and England.': 1,
 'With an Introduction by Field Marshal French': 1,
 'Says Athens ;- May Plan to Force Passage.': 2}

In [20]:
# sorted keys
for key, value in sorted(years.items(), key=lambda x: x[0]): 
    print("{} : {}".format(key, value))

1914 : 590
1915 : 2
Killing a Score and Setting Fires. PLAN A VIGOROUS DEFENSE British Force May Be Aiding Belgians ;- Thousands More Flee to Holland and England. : 1
Military Journal Says. MEXICO CITY A VOLCANO Fear Expressed That Outbreak There May Make Intervention Necessary Any Time. EXPLAINED AS ROUTINE War Department Denies That Vera Cruz Assignments Are Significant ;- Chiefs to Rule Mexico. : 2
Poldhu : 1
Says Athens ;- May Plan to Force Passage. : 2
With an Introduction by Field Marshal French : 1


In [21]:
# sorted values
for key, value in sorted(years.items(), key=lambda x: x[1]): 
    print("{} : {}".format(key, value))

Poldhu : 1
Killing a Score and Setting Fires. PLAN A VIGOROUS DEFENSE British Force May Be Aiding Belgians ;- Thousands More Flee to Holland and England. : 1
With an Introduction by Field Marshal French : 1
Military Journal Says. MEXICO CITY A VOLCANO Fear Expressed That Outbreak There May Make Intervention Necessary Any Time. EXPLAINED AS ROUTINE War Department Denies That Vera Cruz Assignments Are Significant ;- Chiefs to Rule Mexico. : 2
1915 : 2
Says Athens ;- May Plan to Force Passage. : 2
1914 : 590


In [22]:
t=0
f=0

for file in files:
    if is_censorship(files[file][0]):
        
        t+=1
    else:
        f+=1
print("true = ", t, "false = ", f)

true =  599 false =  0


In [23]:
# is_censorship(files[""])

In [46]:
len(files)

599

In [25]:
nlp = spacy.load("en")

In [26]:
# we add some words to the stop word list
for file in files:
    article = []
    doc = nlp(files[file][0].lower())
    for w in doc:
        # if it's not a stop word or punctuation mark, add it to our article!
        if not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I' and not '&' in w.text and not ';' in w.text and not '$' in w.text and len(w.text) > 2 and not ' ' in w.text and not "apos" in w.text and not "quot" in w.text and not "nos" in w.text and not "tlhe" in w.text and not "tlie" in w.text and not "tihe" in w.text and not "thie" in w.text and not "andl" in w.text and not "tile" in w.text and not "tho" in w.text:
            # we add the lematized version of the word
            article.append(w.lemma_)
    files[file].append(article)

Now that we have our dirty corpus, let us now clean it.

## Clean Corpus

We will clean it by removing stop words, lemmatizing, removing punctuation, numbers, spaces, et.c

In [28]:
# iterate through corpus, clean code

In [29]:
from gensim.corpora import Dictionary
import gensim
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel

In [30]:
cleaned_texts = []

In [31]:
for file in files:
    cleaned_texts.append(files[file][3])

In [32]:
bigram = gensim.models.Phrases(cleaned_texts)

In [33]:
texts = [bigram[line] for line in cleaned_texts]



In [34]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [44]:
ldamodel = LdaModel(corpus=corpus, num_topics=7, id2word=dictionary, passes=10)

In [45]:
ldamodel.show_topics()

[(0,
  '0.010*"station" + 0.008*"government" + 0.007*"wireless" + 0.007*"message" + 0.006*"say" + 0.006*"german" + 0.004*"company" + 0.004*"ship" + 0.004*"united_state" + 0.004*"war"'),
 (1,
  '0.005*"say" + 0.004*"state" + 0.003*"day" + 0.003*"war" + 0.003*"court" + 0.003*"time" + 0.003*"man" + 0.003*"right" + 0.003*"law" + 0.002*"good"'),
 (2,
  '0.011*"west" + 0.010*"room" + 0.008*"time" + 0.007*"reference" + 0.005*"good" + 0.005*"man" + 0.004*"bath" + 0.004*"east" + 0.004*"large" + 0.004*"telephone"'),
 (3,
  '0.008*"german" + 0.008*"war" + 0.005*"british" + 0.005*"say" + 0.005*"censor" + 0.004*"news" + 0.004*"man" + 0.004*"government" + 0.003*"country" + 0.003*"cable"'),
 (4,
  '0.007*"play" + 0.004*"war" + 0.004*"good" + 0.004*"theatre" + 0.004*"say" + 0.003*"man" + 0.003*"great" + 0.003*"new" + 0.003*"know" + 0.003*"censorship"'),
 (5,
  '0.012*"german" + 0.006*"army" + 0.006*"man" + 0.006*"force" + 0.004*"war" + 0.004*"say" + 0.004*"line" + 0.004*"great" + 0.004*"report" + 0.00

In [37]:
# list of stop words, following restriction of punctuation, spaces;
# apos quot nos tlhe tlie tihe thie andl tile tho
