## Topic Modelling Newspapers

We are going to use this notebook to topic model our newspaper corpus. We start by setting up our imports.

In [1]:
import gensim



In [2]:
import spacy
import os

In [3]:
import xml.etree.ElementTree as ET

### Accessing xml files

In the following lines of code, we are going to assemble the important information from the xml files. The following lines of code iterates through every XML file and accesses it. But we only add it to our corpus if it is censorship related. The following method identifies this.

In [4]:
# corpus = {}

In [5]:
def is_censorship(text):
    # finds out whether censorship or not; for each additional word we wish to add to constrain our group,
    # we add 'or "word" in text' before the final colon in the immediate line below
    if "censor" in text or "censorship" in text:
        return True
    else:
        return False

In [6]:
def is_year(text):
    years = ["1917"]
    for year in years:
        if year in text:
            return True
    return False

In [7]:
is_censorship("censor censorship suppress ban") # Here we can display individual texts in order to see if they fit within our 'is_censorship' group as defined above

True

In [8]:
texts = [] # We are creating a list called "text"

In [9]:
# i = 0 # Here we include the entire NYT database in our corpus
# files = {}
# for folder in os.listdir("NYT"):
#     for filename in os.listdir("NYT/" + folder):
#         if filename.endswith(".xml"):
#             tree = ET.parse("NYT/" + folder + "/" +filename)
#             root = tree.getroot()
#             try:
#                 if is_censorship(root[-1].text):
#                     files[filename] = []
#                     files[filename].append(root[-1].text)
#                     files[filename].append(root[3].text)
#                     files[filename].append(root[4].text)
#                     # add it to corpus
#             except IndexError:
#                 continue

In [10]:
i = 0 # Here we selectively add folders to our corpus
files = {}
folders = ["NYT/sm_55428_1097/", "NYT/sm_55428_1098/", "NYT/sm_55428_1099/", "NYT/sm_55428_1100/",  "NYT/sm_55428_1101/", "NYT/sm_55428_1102/", "NYT/sm_55428_1103/", "NYT/sm_55428_1104/", "NYT/sm_55428_1105/", "NYT/sm_55428_1106/", "NYT/sm_55428_1109/", "NYT/sm_55428_1110/", "NYT/sm_55428_1111/", "NYT/sm_55428_1112/", "NYT/sm_55428_1113/", "NYT/sm_55428_1114/", "NYT/sm_55428_1116/", "NYT/sm_55428_1117/", "NYT/sm_55428_1118/", "NYT/sm_55428_1119/", "NYT/sm_55428_1120/",]
for folder in folders:
    for filename in os.listdir(folder):
        if filename.endswith(".xml"):
            tree = ET.parse(folder +filename)
            root = tree.getroot()
            try:
                if is_censorship(root[-1].text) and (is_year(root[3].text) or is_year(root[4].text)):
                    files[filename] = []
                    files[filename].append(root[-1].text)
                    files[filename].append(root[3].text)
                    files[filename].append(root[4].text)
                # add it to corpus; '-1' signifies the last element in the .xml file, which is <FullText>
            except IndexError:
                continue
#             i += 1 # for this version we only run 10000 iterations and break after th
#         if i == 10004:
#             break
#     if i == 10004:
#         break    

In [11]:
# i = 0 # Here we selectively add folders to our corpus
# folders = ["NYT/sm_55428_1004/","NYT/sm_55428_1005/"]
# for folder in folders:
#     for filename in os.listdir(folder):
#         if filename.endswith(".txt"):
#             files[filename] = []
#             try:
#                 if is_censorship(filename):
#                     file = open(filename, "r") 
#                     files[filename].append(file.read())
#                 # add it to corpus; '-1' signifies the last element in the .xml file, which is <FullText>
#             except IndexError:
#                 continue
#     i += 1 # for this version we only run 10000 iterations and break after th
#     if i == 10004:
#         break

In [12]:
# os.listdir("NYT/sm_55428_1004/") # to view lists

In [13]:
for file in files: # Here we can see the list of articles that are related to censorship and will be topic modeled
    print(file + "\t" + files[file][1] + "\t" + files[file][2])

sm_55428_1097-3368.xml	Herr Dove, Back from the Front, Thus Reports to Reichstag.	Oct 11, 1917
sm_55428_1097-35967.xml	All Saved from Burning Ship by Cruiser, Except One Gunner.	May 11, 1917
sm_55428_1097-3881.xml	German Official Says the Case Is "Rather Complicated," Nevertheless Clear." PARTY ARRIVES AT ZURICH Numbers Forty-six, Including Families--Some of Them MakeBitter Complaints. Some Awaiting Transfers. Consuls Arrive At Zurich. WANTS GERARD'S RECORD. Newspaper Demands Light on Charges Against Diplomat.	Feb 22, 1917
sm_55428_1097-47088.xml	Apr 1, 1917	19170401
sm_55428_1098-28817.xml	May 19, 1917	19170519
sm_55428_1099-17329.xml	Representative Britten Wants Balfour to Suspend It Immediately.	May 5, 1917
sm_55428_1099-21600.xml	Jan 18, 1917	19170118
sm_55428_1099-32380.xml	The Author of "Nju" Describes the Playwright's Perplexities in the Russia That Was.	Apr 1, 1917
sm_55428_1099-46988.xml	Maps, Plans, and Data of Muntion Plants, Navy Yard, and Armories Seized.ORDERED FROM WASHI

sm_55428_1103-5927.xml	Believes Rushing of a Small Force to France Would Sacrifice Men Needlessly. WOULD HINDER PLANS HERE A Similar French Proposal to the War Department Admitted Need of Men. MUST FIGHT UNDER OUR FLAG Army Officers Think the Country Would Not Tolerate Any Other Arrangement. Proposal Made Formally. Proposal Emphatically Opposed.	May 17, 1917
sm_55428_1103-691.xml	Jul 3, 1917	19170703
sm_55428_1103-9245.xml	Comment on the System Which Gives Germany News Barred Here. Sanity of American Press. Our Disorganized Censorship.	Jul 4, 1917
sm_55428_1103-9318.xml	Jul 3, 1917	19170703
sm_55428_1104-1674.xml	One, Known to Secret Service, Escaped from England on a Friend's Passport. SEEN AT A LOCAL HOTEL Demand for a Real Censorship of Outgoing Cables May Bring Action. Spy Run Out of England. SAY GERMAN SPIES ARE HERE UNLISTED Spies Working for Government. "Lookouts" at Ellis Island. Relayed in Spain. May Tighten Cable Censorship.	Jul 5, 1917
sm_55428_1104-21808.xml	May 3, 1917	191

sm_55428_1106-38068.xml	War Department to Send Nine Regiments at the "Earliest Possible Moment." NEW YORK UNIT INCLUDED But Medical Forces, Now Ready to Sail, Will First Fly Our Flag Abroad. To Go to France Soon. 10,000 ENGINEERS TO GO TO FRANCE Great Problem at Front. OFFERS 3,000 VOLUNTEERS. T.C. Desmond Ready to Supply Skilled Engineering Force. STEPS TO ENLIST MEN HERE. Local Engineers Will Open a Recruiting Station Tomorrow.	May 8, 1917
sm_55428_1106-40272.xml	Jan 3, 1917	19170103
sm_55428_1106-42699.xml	British Express Astonishment That America Got the Luxburg Messages. RECALL ZIMMERMANN NOTE Dispatch of Cipher Telegrams from England to Sweden May Now Be Prohibited. Sweden's Leanings Known. American Diplomatic Skill. DIPLOMATIC COUP SURPRISES LONDON Policy Known to Neutrals. Swedish Election Pending. Denounce Germany's Action. Russian Revolution and Sweden.	Sep 10, 1917
sm_55428_1106-46862.xml	Staats-Zeitung Can't "Countenance Any League to Invade" America. PRO-GERMANS ARE AMAZED

sm_55428_1109-31679.xml	Committee Recommends Seizure and Operation of the Entire industry at Once TO SAFEGUARD FREE PRESS Declares Greed for Excessive Profits Is Imposing a Most Unjust Burden. FINDS PRODUCERS DEFIANT Pooling and Distribution Plan Advocated as a Means of Immediate Relief. The Proposed Resolution. Need Relief at Once. SENATORS URGE CONTROL OF PAPER Small Papers Distressed. 16,000 Papers Affected. Big Profits for Mills. No Censorship Fear. Gave Some Relief.	Oct 8, 1917
sm_55428_1109-31724.xml	Jan 15, 1917	19170115
sm_55428_1109-33399.xml	Administration Willing to Give Further Trial to Plan of Voluntary Restriction.	May 17, 1917
sm_55428_1109-34190.xml	Overman Accepts Substitute Deemed a Long Step Toward Settling the Controversy. VOTE MAY BE TAKEN TODAY Prohibition of Information About Fortifications and Military and Naval Forces.	May 11, 1917
sm_55428_1109-34195.xml	EMPEROR SEES NEUTRALS Envoys in Berlin Called to Conference but Subject Is Hidden. BETHMANN POSITION SHAKY 

sm_55428_1110-18251.xml	Entrance of the United States Into the War, the Withdrawal of Russia, and the Groping for Peace Have Discounted the Year's Fighting THE RUSSIAN REVOLUTION. BELLIGERENTS AND NEUTRALS IN YEAR'S ACTIVITIES THE ITALIAN FRONT. THE EASTERN FRONT. BEYOND EUROPE. ADVENTURE AT ADEN. NAVAL WARFARE. IN POLITICS AND CIVIL LIFE. WAR IN THE AIR. VITAL STATISTICS.	Dec 30, 1917
sm_55428_1110-19041.xml	Hints of Transports' Sailing May Have Given Tip to Spies.	Jul 4, 1917
sm_55428_1110-19826.xml	Debate Will Centre on Motion to Eliminate Section;-A Possible Official Defense.	May 8, 1917
sm_55428_1110-21282.xml	Russian Papers Allowed by the Censor to Publish Various Versions.	Jan 5, 1917
sm_55428_1110-23095.xml	May 19, 1917	19170519
sm_55428_1110-28145.xml	Motion Picture Art League Will Strive to Make Rule Unnecessary.	Jan 11, 1917
sm_55428_1110-28167.xml	State Department Advised That Hindenburg and Ludendorff Control Situation.	Jul 18, 1917
sm_55428_1110-29598.xml	Evidence of Nati

sm_55428_1111-5876.xml	Organized Traffic to South America Charged--Other Cocchi Victims Sought.	Jun 23, 1917
sm_55428_1111-5976.xml	Jul 6, 1917	19170706
sm_55428_1111-7229.xml	Jan 15, 1917	19170115
sm_55428_1111-8243.xml	Ambassador Back from Austria After Passing Zone of Big Raid in Bay of Biscay. TELLS OF VIENNA AGITATION Press Campaign of Two Years Against America;-Submarine Attack on Rochambeau Confirmed.	May 17, 1917
sm_55428_1111-8846.xml	Jan 15, 1917	19170115
sm_55428_1111-8928.xml	Bethlehem and the Shell Order. Sea Raid's Effect on Stocks. Suppressing Commodity Quotations. Anglo-French Bonds and Prices. The A.M.C. Investment. Penalty for Reserve Deficiencies.	Jan 18, 1917
sm_55428_1111-9519.xml	Germany Makes Agreement With Britain That Is of interest to Americans at This Time. TO REPORT ALL CAPTURES And Prisoners to be Allowed to Communicate with Families-- To Give Notice of Reprisals. Notification of Captures. Cannot Be Employed at Front. Punishments to be Remitted.	Sep 30, 191

sm_55428_1113-3277.xml	Representative Kahn Would So Amend Espionage Bill;-Senate Again Takes Up Measure.	May 2, 1917
sm_55428_1113-36948.xml	Restored in New Form After Some Who Had Voted Against It Had Left the Floor. ACTION IS CONDEMNED As Violating "Gentlemen's Agreement" Which Covered First Vote. JURY TRIAL PROVIDED FOR President to Proclaim the Character of Information That May Be Useful to the Enemy. Text of Gard Substitute. Raises Question of Intent. CENSORSHIP PUT THROUGH THE HOUSE Amendments Knocked Out. EMBARGO CLAUSE CHANGED. Senate Adopts a Substitute Restricting President's Power.	May 5, 1917
sm_55428_1113-37342.xml	ACTED WHEN TALK FAILED Troops Swept Forward After War Minister Into Foe's Trenches. DEED THRILLS ALL RUSSIA Teutons Estimate from Sixteento Twenty Divisions Attacking in Southwest. BATTLE SPREADING NORTH Commander on West Front Callson His Armies to Join in Fight Decisive for Nation's Liberty. Army Still Advancing. Official Account of Attack. Germans Admit Defea

sm_55428_1116-17478.xml	John Collier Points How Good May Follow the Failure of Fusion. HUGE WAR CHEST USELESS The Voters, He Adds, "Have Started in to Haul Down the Uplifters."	Dec 3, 1917
sm_55428_1116-19048.xml	House Conferees Continue to Insist Upon Newspaper Censorship.	May 22, 1917
sm_55428_1116-21092.xml	Lord Northcliffe Predicts Close Bond of Germany's Foes as Insurance Against Autocracy. PUTS FAITH IN AIRPLANES British Commissioner Hopes Newspapers and Magazines May Be Allowed to Deal Frankly with War. Lord Northcliffe's Speech. Hope Lies in the Airplane.	Jun 29, 1917
sm_55428_1116-21117.xml	Professor Charles A. Beard Says Narrow Clique Is Controlling the University. FREE SPEECH THE ISSUE Resignation Grows Out of Expulsion of Professors Cattell and Dana. No Question of Pro-Germanism. QUITS COLUMBIA; ASSAILS TRUSTEES	Oct 9, 1917
sm_55428_1116-21259.xml	British Official Says He Misrepresented American Naval Views.	Oct 20, 1917
sm_55428_1116-21486.xml	Would Not Serve Interests of 

sm_55428_1117-19939.xml	Had Sent to a Friend "Titbits of Information";-Not an Act of Treason.	May 16, 1917
sm_55428_1117-21476.xml	First English Version of Former Minister's Speech at Havre.	Jun 9, 1917
sm_55428_1117-23615.xml	First Day's Supervision Reveals Only a Few Messages Not Clearly Expressed.	May 5, 1917
sm_55428_1117-27459.xml	British Retired Officer Reports Part of the German Losses.	Mar 11, 1917
sm_55428_1117-31891.xml	Krylenko's Report?	Dec 7, 1917
sm_55428_1117-32109.xml	War Department Chiefs Aim to Prevent Sending of Military Information. MEN'S STATIONS CONCEALED Efforts Will Be Made to Facilitate Correspondence That Is Within the Rules. Postage at Domestic Rates. Mail Service for Soldiers. Specimen Cable Messages.	Jul 10, 1917
sm_55428_1117-33183.xml	Lax Censorship of Outgoing Cables May Have Contributed;- Publication of Troop Movements Here and Activity of a Red Cross Agent. HOW DID GERMANY GET ADVANCE NEWS?	Jul 4, 1917
sm_55428_1117-34511.xml	More Detailed Reports of D

sm_55428_1118-34976.xml	London Diplomat Thinks Government Is Suppressing Anti-U-Boat Sentiment. PRAISES WILSON'S PLAN He Looks for Great Results from the President's Appeal to Neutral Nations. Two Courses Open to Germany. Sees Divisions Developing in Germany.	Feb 6, 1917
sm_55428_1118-37626.xml	Will Silk Replace Cotton? Steel Price Question Affects Trading. Decline of the Reichsmark. A Paper Burden. Motor Dividend Action Put Off. Value of a Dollar.	Jun 20, 1917
sm_55428_1118-38233.xml	Suffrage a State Right. Slow Snow Removal. A Test of Organization Met. Home Care to Combat Illness.	Dec 22, 1917
sm_55428_1118-41995.xml	Trustees Want to Know if Any Professors Are Propounding Unpatriotic Doctrines. AIMED AT THE PACIFISTS Board Also Adopts Resolution Pledging Support of the University to the Nation for Defense. BUTLER ANSWERS SPECTATOR. Denies to College Paper That Faculty Restrains Free Speech. AMERICAN SHIPOWNERS WAIT. Ready to Mount Guns if Navy Department Supplies Them.	Mar 6, 1917
sm

sm_55428_1119-26295.xml	Washington Aroused by Publication of Arrival of Forces in France. GOVERNORS GOT MESSAGES Colonel and Lisutenant Colonel as Well as Censors Must Explain. OFFICIAL REQUEST IGNORED Dispatches Heralded in Many Newspapers Despite Department's Pica. Predicts Drastic Action. TO TRY OFFICERS GIVING TROOP NEWS Why News Is Kept Secret.	Oct 15, 1917
sm_55428_1119-26679.xml	War Department Chiefs Aim to Prevent Sending of Military Information. MEN'S STATIONS CONCEALED Efforts Will Be Made to Facilitate Correspondence That Is Within the Rules. Postage at Domestic Rates. Mail Service for Soldiers. Specimen Cable Messages.	Jul 10, 1917
sm_55428_1119-27224.xml	London Diplomat Thinks Government Is Suppressing Anti-U-Boat Sentiment. PRAISES WILSON'S PLAN He Looks for Great Results from the President's Appeal to Neutral Nations. Two Courses Open to Germany. Sees Divisions Developing in Germany.	Feb 6, 1917
sm_55428_1119-32246.xml	Washington Starts Inquiry Into Cables Telling of Int

sm_55428_1120-17432.xml	Broker Who Cornered Coffee Market in 1907 Had Been Living in Germany During War.	Nov 23, 1917
sm_55428_1120-19198.xml	May 26, 1917	19170526
sm_55428_1120-19839.xml	Washington Aroused by Publication of Arrival of Forces in France. GOVERNORS GOT MESSAGES Colonel and Lisutenant Colonel as Well as Censors Must Explain. OFFICIAL REQUEST IGNORED Dispatches Heralded in Many Newspapers Despite Department's Pica. Predicts Drastic Action. TO TRY OFFICERS GIVING TROOP NEWS Why News Is Kept Secret.	Oct 15, 1917
sm_55428_1120-19956.xml	Hohenzollerns and Hapsburgs Will Follow Romanoffs, He Declares. MASS MEETING FOR RUSSIA Cable Message from President Lvoff Cheered by 1,500 Celebrating Success of the Rovolution. Message from Roosevelt. Why Grand Duke Was Removed.	Mar 26, 1917
sm_55428_1120-20014.xml	"Impulsive Statements" Criticised at Session of Reichstag Main Committee.SPONSORED BY MICHAELISKuehimann Asserts Chancellor HasRepresentative at Headquarters and Is Responsible.	S

sm_55428_1120-42302.xml	Full Text Was Not Issued by the London Press Bureau Until 9 o'Clock Wednesday Evening. COMMENT GIVEN OUT FIRST Paralysis of Newspaper Initiative Laid to the Censorship "Spoonfeeding" Methods. Censorship Lessens Effort. Editorial Comment Rushed.	Sep 1, 1917
sm_55428_1120-42544.xml	Livelier Artillery Fighting in Their Sector Also Causes Shrapnel Wounds. COMMENDED FOR BRAVERY Officers and Others Cited by a French General--Note on Trench Raid. SHELLS KILL MORE OF PERSHING MEN	Nov 18, 1917
sm_55428_1120-42697.xml	With Artists and Composers, Send Greetings to Writers Who Aided in Revolution. ELIHU ROOT JOINS IN LETTER President Butler Calls the Uprising a "Product of Philosophy and Letters."	Apr 24, 1917
sm_55428_1120-42711.xml	Described in Letters to His Chief in Washington, Now First Published. 90,000 WORKERS DEPORTED Exiles Underfed and Unpaid;-235 Said to Have Died in Ghent;-A German "Sea of Hate." An Efftetive Allied Spy System. "The Nightmare of the Deportations

In [14]:
len(files)

493

In [15]:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

In [16]:
years = {}

In [17]:
for file in files:
    for month in months:
        try:
            if month in files[file][1]:
                year = files[file][1].split(", ")[1]
                break
            if month in files[file][2]:
                year = files[file][2].split(", ")[1]
                break
        except IndexError:
            continue
    if year in years:
        years[year] += 1
    else:
        years[year] = 1
    print(file + "\t" + year + "\n")

sm_55428_1097-3368.xml	1917

sm_55428_1097-35967.xml	1917

sm_55428_1097-3881.xml	1917

sm_55428_1097-47088.xml	1917

sm_55428_1098-28817.xml	1917

sm_55428_1099-17329.xml	1917

sm_55428_1099-21600.xml	1917

sm_55428_1099-32380.xml	1917

sm_55428_1099-46988.xml	1917

sm_55428_1100-11699.xml	1917

sm_55428_1100-12241.xml	1917

sm_55428_1100-19507.xml	1917

sm_55428_1100-21148.xml	1917

sm_55428_1100-25343.xml	1917

sm_55428_1100-30866.xml	1917

sm_55428_1100-31593.xml	1917

sm_55428_1100-32458.xml	1917

sm_55428_1100-33107.xml	1917

sm_55428_1100-7976.xml	1917

sm_55428_1100-8059.xml	Known to Secret Service

sm_55428_1101-10611.xml	1917

sm_55428_1101-30183.xml	1917

sm_55428_1101-6088.xml	1917

sm_55428_1101-8868.xml	1917

sm_55428_1101-8950.xml	1917

sm_55428_1102-17316.xml	1917

sm_55428_1102-22421.xml	1917

sm_55428_1102-22850.xml	1917

sm_55428_1102-30006.xml	1917

sm_55428_1102-33462.xml	1917

sm_55428_1102-40318.xml	1917

sm_55428_1102-4535.xml	1917

sm_55428_1103-10227.xml	1917



sm_55428_1114-26403.xml	1917

sm_55428_1114-29902.xml	1917

sm_55428_1114-3182.xml	1917

sm_55428_1114-32761.xml	1917

sm_55428_1114-32976.xml	1917

sm_55428_1114-33035.xml	1917

sm_55428_1114-3327.xml	1917

sm_55428_1114-34079.xml	1917

sm_55428_1114-34743.xml	1917

sm_55428_1114-41745.xml	1917

sm_55428_1114-42202.xml	1917

sm_55428_1114-42356.xml	1917

sm_55428_1114-49207.xml	1917

sm_55428_1114-49222.xml	1917

sm_55428_1114-49233.xml	1917

sm_55428_1114-49851.xml	1917

sm_55428_1114-628.xml	1917

sm_55428_1114-7196.xml	1917

sm_55428_1114-8265.xml	1917

sm_55428_1114-8974.xml	1917

sm_55428_1116-12764.xml	1917

sm_55428_1116-1304.xml	1917

sm_55428_1116-17478.xml	He Adds

sm_55428_1116-19048.xml	1917

sm_55428_1116-21092.xml	1917

sm_55428_1116-21117.xml	1917

sm_55428_1116-21259.xml	1917

sm_55428_1116-21486.xml	1917

sm_55428_1116-21642.xml	Heeding President

sm_55428_1116-23767.xml	1917

sm_55428_1116-26313.xml	1917

sm_55428_1116-32587.xml	1917

sm_55428_1116-33759.xml	1917

s

In [18]:
# for line in open("1004-1119.txt"):
#     try:
#         name, date1, date2 = line.split("\t")
#     except ValueError:
#         print(line)
#     for month in months:
#         if month in date1:
#             try:
#                 if date1.split(", ")[1].isdigit():
#                     year = date1.split(", ")[1]
#             except IndexError:
#                 continue
#             break
#         if month in date2:
#             try:
#                 if date2.split(", ")[1].isdigit():
#                     year = date2.split(", ")[1]
#             except IndexError:
#                 continue
#             break
#     if int(year) in years:
#         years[int(year)] += 1
#     else:
#         years[int(year)] = 1
        
#     print(name + "\t" + year)

In [19]:
years

{'1917': 485,
 'Known to Secret Service': 2,
 "Merchants' Association Hears. URGED TO PREPARE AT ONCE We Must Act as If the Struggle Was Single-Handed": 1,
 'He Tells Phi Gamma Deltas. WAR IS AGAINST MISRULE It May Be Long': 1,
 'He Adds': 1,
 'Heeding President': 1,
 'Quick Information on Any Marine Losses.': 1,
 'Says Marshall. THINKS CHANGE PERMANENT Revolution Too Deep-Seated to Yield to Reaction;-Movement May Spread to Germany. Russian Jews Loyal Will Help Develop Russia.': 1}

In [20]:
# sorted keys
for key, value in sorted(years.items(), key=lambda x: x[0]): 
    print("{} : {}".format(key, value))

1917 : 485
He Adds : 1
He Tells Phi Gamma Deltas. WAR IS AGAINST MISRULE It May Be Long : 1
Heeding President : 1
Known to Secret Service : 2
Merchants' Association Hears. URGED TO PREPARE AT ONCE We Must Act as If the Struggle Was Single-Handed : 1
Quick Information on Any Marine Losses. : 1
Says Marshall. THINKS CHANGE PERMANENT Revolution Too Deep-Seated to Yield to Reaction;-Movement May Spread to Germany. Russian Jews Loyal Will Help Develop Russia. : 1


In [21]:
# sorted values
for key, value in sorted(years.items(), key=lambda x: x[1]): 
    print("{} : {}".format(key, value))

Merchants' Association Hears. URGED TO PREPARE AT ONCE We Must Act as If the Struggle Was Single-Handed : 1
He Tells Phi Gamma Deltas. WAR IS AGAINST MISRULE It May Be Long : 1
He Adds : 1
Heeding President : 1
Quick Information on Any Marine Losses. : 1
Says Marshall. THINKS CHANGE PERMANENT Revolution Too Deep-Seated to Yield to Reaction;-Movement May Spread to Germany. Russian Jews Loyal Will Help Develop Russia. : 1
Known to Secret Service : 2
1917 : 485


In [22]:
t=0
f=0

for file in files:
    if is_censorship(files[file][0]):
        
        t+=1
    else:
        f+=1
print("true = ", t, "false = ", f)

true =  493 false =  0


In [23]:
# is_censorship(files[""])

In [40]:
len(files)

493

In [25]:
nlp = spacy.load("en")

In [26]:
# we add some words to the stop word list
for file in files:
    article = []
    doc = nlp(files[file][0].lower())
    for w in doc:
        # if it's not a stop word or punctuation mark, add it to our article!
        if not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I' and not '&' in w.text and not ';' in w.text and not '$' in w.text and len(w.text) > 2 and not ' ' in w.text and not "apos" in w.text and not "quot" in w.text and not "nos" in w.text and not "tlhe" in w.text and not "tlie" in w.text and not "tihe" in w.text and not "thie" in w.text and not "andl" in w.text and not "tile" in w.text and not "tho" in w.text:
            # we add the lematized version of the word
            article.append(w.lemma_)
    files[file].append(article)

In [27]:
# stop word list changelog
# 12/2: added "apos" and "quot" as undesirable remnants of .xml punctuation codings

Now that we have our dirty corpus, let us now clean it.

## Clean Corpus

We will clean it by removing stop words, lemmatizing, removing punctuation, numbers, spaces, etc.

In [28]:
# iterate through corpus, clean code

In [29]:
from gensim.corpora import Dictionary
import gensim
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel

In [30]:
cleaned_texts = []

In [31]:
for file in files:
    cleaned_texts.append(files[file][3])

In [32]:
bigram = gensim.models.Phrases(cleaned_texts)

In [33]:
texts = [bigram[line] for line in cleaned_texts]



In [34]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [41]:
ldamodel = LdaModel(corpus=corpus, num_topics=7, id2word=dictionary, passes=10)

In [42]:
ldamodel.show_topics()

[(0,
  '0.012*"german" + 0.006*"war" + 0.005*"say" + 0.005*"government" + 0.004*"american" + 0.004*"germany" + 0.004*"man" + 0.003*"state" + 0.003*"enemy" + 0.003*"year"'),
 (1,
  '0.004*"censor" + 0.003*"work" + 0.003*"french" + 0.002*"message" + 0.002*"word" + 0.002*"cent" + 0.002*"code" + 0.002*"mail" + 0.002*"force" + 0.002*"american"'),
 (2,
  '0.009*"say" + 0.007*"war" + 0.006*"german" + 0.005*"germany" + 0.005*"government" + 0.005*"american" + 0.004*"country" + 0.003*"newspaper" + 0.003*"dispatch" + 0.003*"information"'),
 (3,
  '0.006*"war" + 0.004*"time" + 0.004*"say" + 0.003*"man" + 0.003*"people" + 0.003*"country" + 0.003*"german" + 0.003*"come" + 0.003*"germany" + 0.003*"great"'),
 (4,
  '0.007*"war" + 0.006*"censorship" + 0.006*"say" + 0.005*"newspaper" + 0.005*"government" + 0.004*"time" + 0.004*"country" + 0.004*"information" + 0.004*"president" + 0.003*"american"'),
 (5,
  '0.005*"say" + 0.005*"government" + 0.003*"censorship" + 0.003*"know" + 0.003*"time" + 0.003*"germ

In [37]:
# list of stop words, following restriction of punctuation, spaces;
# apos quot nos tlhe tlie tihe thie andl tile tho
