## Topic Modelling Newspapers

We are going to use this notebook to topic model our newspaper corpus. We start by setting up our imports.

In [1]:
import gensim



In [2]:
import spacy
import os

In [3]:
import xml.etree.ElementTree as ET

### Accessing xml files

In the following lines of code, we are going to assemble the important information from the xml files. The following lines of code iterates through every XML file and accesses it. But we only add it to our corpus if it is censorship related. The following method identifies this.

In [4]:
# corpus = {}

In [5]:
def is_censorship(text):
    # finds out whether censorship or not; for each additional word we wish to add to constrain our group,
    # we add 'or "word" in text' before the final colon in the immediate line below
    if "censor" in text or "censorship" in text:
        return True
    else:
        return False

In [6]:
def is_year(text):
    years = ["1916"]
    for year in years:
        if year in text:
            return True
    return False

In [7]:
is_censorship("censor censorship suppress ban") # Here we can display individual texts in order to see if they fit within our 'is_censorship' group as defined above

True

In [8]:
texts = [] # We are creating a list called "text"

In [9]:
# i = 0 # Here we include the entire NYT database in our corpus
# files = {}
# for folder in os.listdir("NYT"):
#     for filename in os.listdir("NYT/" + folder):
#         if filename.endswith(".xml"):
#             tree = ET.parse("NYT/" + folder + "/" +filename)
#             root = tree.getroot()
#             try:
#                 if is_censorship(root[-1].text):
#                     files[filename] = []
#                     files[filename].append(root[-1].text)
#                     files[filename].append(root[3].text)
#                     files[filename].append(root[4].text)
#                     # add it to corpus
#             except IndexError:
#                 continue

In [10]:
i = 0 # Here we selectively add folders to our corpus
files = {}
folders = ["NYT/sm_55428_1097/", "NYT/sm_55428_1098/", "NYT/sm_55428_1099/", "NYT/sm_55428_1100/",  "NYT/sm_55428_1101/", "NYT/sm_55428_1102/", "NYT/sm_55428_1103/", "NYT/sm_55428_1104/", "NYT/sm_55428_1105/", "NYT/sm_55428_1106/", "NYT/sm_55428_1109/", "NYT/sm_55428_1110/", "NYT/sm_55428_1111/", "NYT/sm_55428_1112/", "NYT/sm_55428_1113/", "NYT/sm_55428_1114/", "NYT/sm_55428_1116/", "NYT/sm_55428_1117/", "NYT/sm_55428_1118/", "NYT/sm_55428_1119/", "NYT/sm_55428_1120/",]
for folder in folders:
    for filename in os.listdir(folder):
        if filename.endswith(".xml"):
            tree = ET.parse(folder +filename)
            root = tree.getroot()
            try:
                if is_censorship(root[-1].text) and (is_year(root[3].text) or is_year(root[4].text)):
                    files[filename] = []
                    files[filename].append(root[-1].text)
                    files[filename].append(root[3].text)
                    files[filename].append(root[4].text)
                # add it to corpus; '-1' signifies the last element in the .xml file, which is <FullText>
            except IndexError:
                continue
#             i += 1 # for this version we only run 10000 iterations and break after th
#         if i == 10004:
#             break
#     if i == 10004:
#         break    

In [11]:
# i = 0 # Here we selectively add folders to our corpus
# folders = ["NYT/sm_55428_1004/","NYT/sm_55428_1005/"]
# for folder in folders:
#     for filename in os.listdir(folder):
#         if filename.endswith(".txt"):
#             files[filename] = []
#             try:
#                 if is_censorship(filename):
#                     file = open(filename, "r") 
#                     files[filename].append(file.read())
#                 # add it to corpus; '-1' signifies the last element in the .xml file, which is <FullText>
#             except IndexError:
#                 continue
#     i += 1 # for this version we only run 10000 iterations and break after th
#     if i == 10004:
#         break

In [12]:
# os.listdir("NYT/sm_55428_1004/") # to view lists

In [13]:
for file in files: # Here we can see the list of articles that are related to censorship and will be topic modeled
    print(file + "\t" + files[file][1] + "\t" + files[file][2])

sm_55428_1097-13873.xml	Jan 20, 1916	19160120
sm_55428_1097-15225.xml	Picture Board of Trade's Show in Madison Square Garden Reveals All the Secrets. BLACKTON HITS CENSORSHIP " The Greatest Peril That Menaces the Photo Play," President Says ;- Parade Gets Out of Focus.	May 7, 1916
sm_55428_1097-16951.xml	Everything Going Well, Pershing Notified Funston from the Field.	Mar 17, 1916
sm_55428_1097-18162.xml	But Neutral Conference Committee Refuses to Reveal Her Identity Now. TAKE LETTER TO WILSON President Not In When Delegation Headed by G.F. Peabody Calls at the White House.	Dec 24, 1916
sm_55428_1097-183.xml	Carranza's Attitude May Decide Peace or War in 48 Hours. DOUBLE DEALING CHARGED Our Army Will Stay, Lansing Says, but Our Sole Purpose Is Still to Guard Our Border. RECITES BORDER OFFENSES And Declares First Chief Fanned Popular Ill-Will ;- Washington Rushing All Preparations.	Jun 21, 1916
sm_55428_1097-18685.xml	Thinks He Has Set Forth Momentous Principles.	May 31, 1916
sm_55428_1

sm_55428_1099-13841.xml	Deputies Say His Activities Are "Absolutely Illegal."	Jan 23, 1916
sm_55428_1099-13861.xml	Drastic Amendments Adopted, Intended to Reach Britain and Her Allies. THREAT TO CLOSE THE MAILS Final Vote on the Measure, 42 to 16, Is Reached After 14 Hours of Debate. TARIFF BOARD IS CREATED News Print Paper Under 5 Cents a Pound to be Admitted Free as a Relief for Famine. REVENUE BILL PASSED BY SENATE	Sep 6, 1916
sm_55428_1099-14169.xml	Rev. Dr. Goodchild Charges Gen. Funston Objected to Revivals on the Border. TO COMPLAIN TO CONGRESS Will Ask for Investigation of Religious Condition in the U.S. Army.	Nov 20, 1916
sm_55428_1099-17213.xml	Oct 11, 1916	19161011
sm_55428_1099-17231.xml	Stir Resentment Here, Especially the Course in Ireland, Manchester Guardian Says.	Aug 19, 1916
sm_55428_1099-20077.xml	Oct 23, 1916	19161023
sm_55428_1099-20093.xml	Zinc Dross and Carburetors Among the Articles Passed On.	Oct 29, 1916
sm_55428_1099-20690.xml	Jan 21, 1916	19160121
sm_55428_1

sm_55428_1100-16924.xml	French Minister of Marine Asserts That Allies Strive to be Fair to Neutrals.	Sep 2, 1916
sm_55428_1100-17161.xml	Commander's Guiding Principle an Offensive Regardless of Cost. WATCHES ALL FRONTS Thirty Telegraph Instruments Keep Him Informed on Entire Field of War. HIS ARMY IN HIGH SPIRITS General Goes Personally Into Trenches to See That the Soldiers Lack Nothing. LINSINGEN TELLS OF KOVEL ATTACK	Jul 27, 1916
sm_55428_1100-17297.xml	Dec 10, 1916	19161210
sm_55428_1100-17938.xml	Lansing Sends to London a Protest of American Correspondents in Berlin.	Aug 11, 1916
sm_55428_1100-19270.xml	Mar 26, 1916	19160326
sm_55428_1100-20234.xml	Private Letters Reach London Embassy After Being Read by Censor.	Jan 11, 1916
sm_55428_1100-21043.xml	15,000 at Sheepshead Bay See Western Men and Women Ride Untamed Horses. COWGIRLS IN RELAY RACE Actor-Cowboy "Bulldogs" Wild Steer That Ripped Out Part of a Steel Fence.	Aug 6, 1916
sm_55428_1100-21725.xml	Nov 28, 1916	19161128
sm_55428_

sm_55428_1100-44845.xml	British Say U-Boat That Sank Liner Was Destroyed with Crew Soon Afterward. AND DYING MEN CONFESSED Insist Commander Schneider, Said to Have Been Rebuked, Never Returned to Port. CHARGE UNTRUTH IN ARABIC NOTE	Mar 2, 1916
sm_55428_1100-45441.xml	Government Permission, However, Extends Only as Far as Border.	Jun 25, 1916
sm_55428_1100-46632.xml	Teuton Correspondent's Estimate of Total for the Summer Campaign.	Aug 12, 1916
sm_55428_1100-48066.xml	Must Do So to Get American Supplies Now Held In England.	Apr 29, 1916
sm_55428_1100-48097.xml	Trying to Meet American Complaints While Preparing Reply to Note, Says Bunsen. WILL HELP INQUIRERS Cecil Indicates Allies Will Move to End "Misconceptions" as to the Blockade.	Jun 17, 1916
sm_55428_1100-48239.xml	British Restriction for Britons and Canadians In Germany.	Nov 2, 1916
sm_55428_1100-48660.xml	Army Officers Restlessly Await Right to Use Railroads Into Mexico.	Mar 28, 1916
sm_55428_1100-48835.xml	Oct 14, 1916	19161014
sm

sm_55428_1101-3528.xml	Newspapers Renew Charge That He Assumes an Unfriendly Policy Toward Germany. BASED ON NEWS REPORT President Put in Light of Bluffing Germany and Changing His Course Since Election.	Dec 3, 1916
sm_55428_1101-35463.xml	Censorship Went Through Its Delaying Process in This as in Everything Else. TIMES GAVE CANADA TEXT All That Lloyd George Said, with Hall Caine's Description of Scene, Sent to Montreal from New York.	Dec 21, 1916
sm_55428_1101-35815.xml	Nov 1, 1916	19161101
sm_55428_1101-36150.xml	Washington Professes No Alarm Over Submarines, but Is Mystified by Their Acts. ANNOYED BY BERLIN HINT Lansing Indicates That Alarmist Rumors Are Being Created In This Country.	Nov 25, 1916
sm_55428_1101-36916.xml	Great Britain and France Make Reply to Our Protest of Interruption. CONTRABAND IS OBTAINED Rubber, Metal, Sausages, and Odd Lot of Other Merchandise Found on Ships.	Apr 4, 1916
sm_55428_1101-37954.xml	Poilus Have Learned Lessons of Modern Warfare Allies Have Neglect

sm_55428_1102-25875.xml	Nation, Cut Off from the World, Is Kept in Ignorance by Its German Masters. FUNDS LOW, FOOD SCARCE Defensive Forces, Including 60,000 Teutons, Estimated at Less Than 400,000 Men.	Aug 25, 1916
sm_55428_1102-26070.xml	Professor George Trumbull Ladd Says That Nation's School Teachers Should Have Greater Freedom and Improved Culture	Aug 6, 1916
sm_55428_1102-2610.xml	The Season's Contributions to Literature Indicated by a Carefully Selected List of Representative Publications SOME AUTHORS PROMINENT IN THIS SEASON'S BOOKS THREE HUNDRED LEADING SPRING BOOKS THREE HUNDRED LEADING SPRING BOOKS THREE HUNDRED LEADING SPRING BOOKS THREE HUNDRED LEADING SPRING BOOKS THREE HUNDRED LEADING SPRING BOOKS THREE HUNDRED LEADING SPRING BOOKS THREE HUNDRED LEADING SPRING BOOKS	Apr 16, 1916
sm_55428_1102-26397.xml	Lord Robert Cecil Tells Foreign Correspondents His Idea on News Curbs	Feb 19, 1916
sm_55428_1102-26880.xml	Tells Bankers Here It Operates as a "White List" of American Fir

sm_55428_1102-5430.xml	Declares Suffragists, with Endless Chain Postals, Are Repeating Liquor Attacks. ANTIS' HEAD INDIGNANT Denounces "Cowardly Scheme" and Insists That Liquor Interests Never Aided the Organization.	Oct 30, 1916
sm_55428_1102-5543.xml	Apr 27, 1916	19160427
sm_55428_1102-5659.xml	Royal Governor General Has Shown High Ability, in Dominion's Opinion. FIRST AND LAST A SOLDIER Though Hedged About with Idea of Caste, He Showed Democratic Spirit.	Sep 17, 1916
sm_55428_1102-9381.xml	Two Cars Roll Down Embankment in Mexico -- Aviator Bowen Hurt in Fall.	Mar 26, 1916
sm_55428_1102-9840.xml	If Germany Elects to Make Her Main Effort in the Eastern Field Allies Will Have to Meet and Seek to Overcome German Armies There.	Dec 10, 1916
sm_55428_1103-1000.xml	M. Neratoff, Former Assistant to Stuermer, Becomes the Chief.	Nov 26, 1916
sm_55428_1103-10102.xml	Mexican Diplomat Repudiates Statement About American Aid of Villa. FRICTION STILL POSSIBLE Carranza's Reported Criticism of United

sm_55428_1103-29391.xml	Rigorous Censorship on Reports of What Is Happening There.	May 31, 1916
sm_55428_1103-3026.xml	President's Protest and Reassuring Reports Expected to Lessen Republican Efforts. MEXICAN CAMPS ON VIEW Gen. Calles Invites American Inspection to Show Nothing Hostile Is Being Done. MAY NOT URGE BIG FORCE ON BORDER	Mar 27, 1916
sm_55428_1103-30321.xml	Hermann Fernau, in New Book, Flays Censorship, Ridicules Intellectuals, and Demands Punishment for Instigators of the War	May 14, 1916
sm_55428_1103-3038.xml	Dec 21, 1916	19161221
sm_55428_1103-30891.xml	Vessels Heavily Armed and Disguised Sent Over in Anticipation of New U-Boat Raid.	Dec 27, 1916
sm_55428_1103-31275.xml	Jun 8, 1916	19160608
sm_55428_1103-31280.xml	Point Raised That U-Boat Attacks Constitute a Virtual Blockade.	Oct 9, 1916
sm_55428_1103-31683.xml	Some of the Interesting Reminiscences of a Veteran in the Jewelry Trade.	Sep 17, 1916
sm_55428_1103-3169.xml	Sep 29, 1916	19160929
sm_55428_1103-32183.xml	Signe

sm_55428_1103-8203.xml	Proclamation of Martial Law Disavows Intention to Destroy Republic's Independence.	Dec 2, 1916
sm_55428_1103-8898.xml	Tent City Overshadows Adobe and Pine Buildings in the Little Border Town. PUBLIC BATHTUB IN DEMAND Negro Says "Columbus Discovered America, but It Took Villa to Discover Columbus."	Apr 2, 1916
sm_55428_1103-8983.xml	Thought It Better to Aid Soldiers' Families Than to Give the Fighters More Money.	Jan 21, 1916
sm_55428_1103-9332.xml	Edison's Man Finds Mysterious Mixture Put in Water Was Acetone.	May 29, 1916
sm_55428_1104-10393.xml	May 4, 1916	19160504
sm_55428_1104-10582.xml	Reveal in War Department Statement How Newspapers Have Affected Campaigns. JAPAN'S POLICY OUTLINED Say Sherman's March to the Sea Was Due to Information Imparted by Southern Journals.	Jul 7, 1916
sm_55428_1104-10811.xml	May 27, 1916	19160527
sm_55428_1104-11086.xml	Jun 21, 1916	19160621
sm_55428_1104-11271.xml	Jan 6, 1916	19160106
sm_55428_1104-11428.xml	Sep 28, 1916	19160928


sm_55428_1104-34847.xml	Wants Our Citizens to Return as Quickly as Practicable. CRISIS WAITS ON CARRANZA Final Issue Probably Will Not Come Up Before House Meets Again on Wednesday. VILLA WITH CARRANZISTAS? Funston Sends Rumor That the Bandit Is at Bustillos ;- Increase in Recruiting. LANSING WANTS AMERICANS TO LEAVE	Jul 2, 1916
sm_55428_1104-35361.xml	Citations of Failure in American Neutrality, Always to Germany's Cost ;- Strength of the Hyphenate Vote Analyzed.	May 18, 1916
sm_55428_1104-35972.xml	Large Detachment Enters from Columbus -- Scouts with Troopers.	Mar 21, 1916
sm_55428_1104-37241.xml	Reinspection of 22 Places Shows 4 Advanced to Fair and 8 Making Progress. MAYOR WON'T CALL HALT Samuel L. Martin, His Secretary, Asserts There Will Be No Calling Off of Inquiry.	Jun 18, 1916
sm_55428_1104-37627.xml	Oct 23, 1916	19161023
sm_55428_1104-37797.xml	German Situation Subject of Night Council ;- President Stands on Terms of His Note. GERMANY'S REPLY PROMISED TODAY	May 4, 1916
sm_554

sm_55428_1104-673.xml	Army Officers Admit It Has Been Farcical and Favor Revision of Rules. STRATEGY SET AT NAUGHT War College Cites History and Favors Penalties ;- Plight of a Sagebrush Journalist.	Jun 4, 1916
sm_55428_1104-722.xml	Veteran Moro Fighter to Take Up Actual Pursuit of Villa. FREE REIN FOR FUNSTON President Gives Him Full Power to Deal with Whole Chihuahua Situation. ALL MOVEMENTS SECRET War Department Observes the General's Advice -- More Dead Mexicans Found.	Mar 12, 1916
sm_55428_1104-7816.xml	Censorship Remains in Force to Prevent Comment on the War.	Jul 22, 1916
sm_55428_1104-8426.xml	Oct 11, 1916	19161011
sm_55428_1104-9317.xml	Trying to Meet American Complaints While Preparing Reply to Note, Says Bunsen. WILL HELP INQUIRERS Cecil Indicates Allies Will Move to End "Misconceptions" as to the Blockade.	Jun 17, 1916
sm_55428_1104-9374.xml	ART AT HOME AND ABROAD	Apr 23, 1916
sm_55428_1104-9524.xml	Leaves Columbus, Bound for Big Bend District of Texas.	May 21, 1916
sm_5542

sm_55428_1105-28984.xml	May 28, 1916	19160528
sm_55428_1105-29419.xml	Daughter of Their Commandant General Answers Charges.	Aug 30, 1916
sm_55428_1105-30096.xml	15,000 at Sheepshead Bay See Western Men and Women Ride Untamed Horses. COWGIRLS IN RELAY RACE Actor-Cowboy "Bulldogs" Wild Steer That Ripped Out Part of a Steel Fence.	Aug 6, 1916
sm_55428_1105-30665.xml	Apr 18, 1916	19160418
sm_55428_1105-30911.xml	New Yorker Hears from a Friend, Apparently by Courtesy of a Raider.	Apr 5, 1916
sm_55428_1105-32287.xml	Two Cars Roll Down Embankment in Mexico -- Aviator Bowen Hurt in Fall.	Mar 26, 1916
sm_55428_1105-32928.xml	Preparations Made in El Paso for an American Force to Occupy City. CARRANZA GARRISON LEAVES Troop and Supply Trains Pull Out of City, Now Dominated by Gen. Bell's Cannon. JUAREZ VACATED, ARMY MAY ENTER	Jun 23, 1916
sm_55428_1105-33473.xml	Prominent Statesmen See in It Outspoken Sympathy with Allied Cause. SOME EXPRESS SURPRISE President Believed to be Making Amends for Prev

sm_55428_1106-16300.xml	Jul 1, 1916	19160701
sm_55428_1106-16545.xml	Jul 22, 1916	19160722
sm_55428_1106-16666.xml	German Socialists Put Agent in Office of Vorwaerts, Party Organ.	Apr 9, 1916
sm_55428_1106-17069.xml	Enormous Sums of Money Circulating in These Piping Times of War. PEASANTRY PROSPEROUS But Employes In Cities Suffer from High Food Prices and Supplies Are Scarce.	Mar 19, 1916
sm_55428_1106-17439.xml	Jun 7, 1916	19160607
sm_55428_1106-17932.xml	Senate Passes Bill Authorizing Appointment by the Regents.	Apr 6, 1916
sm_55428_1106-180.xml	But Result of Battle Causes Dissatisfaction Pending Full Details from Jellicoe. ZEPPELINS' VALUE SEEN Their Scouting, it Is Assumed, Enabled Germans to Take Up Advantageous Position Secretly.	Jun 4, 1916
sm_55428_1106-18037.xml	" A Kiss for Cinderella," at the Empire, Is Barrieism Raised to the Nth Power. A LITTLE SLAVEY'S DREAM Maude Adams Is as Winsome as Ever in a Charming but Conspicuously Scanty Fancy.	Dec 26, 1916
sm_55428_1106-18141.xm

sm_55428_1106-4001.xml	British Regulars Seize Chief Centres of Dublin and Surround Disturbed Area. 25 FILIBUSTERS CAUGHT British Patrol Activity, Says One Account, Forced Crew to Sink Arms-Laden Steamer. CASEMENT FOUND IN BOAT Another Report Says the Police Caught the ex-Consul With Two Aides After Reaching Shore. MARTIAL LAW CURBS IRELAND	Apr 27, 1916
sm_55428_1106-40264.xml	American Border Towns Get Reports of Serious Outbreaks in Several Commands. COLONEL ROJAS WAS SLAIN Presidio Arms as Result of Rioting Across the Border -- Douglas Has Alarming Rumors.	Mar 16, 1916
sm_55428_1106-40805.xml	Ferdinand of Rumania Says It Must Not Give Way.	Oct 27, 1916
sm_55428_1106-41507.xml	Nov 21, 1916	19161121
sm_55428_1106-42131.xml	Miss Carolyn Beebe's Organization Appears In Aeolian Hall.	Oct 25, 1916
sm_55428_1106-42357.xml	Passengers Arriving Here Say the Liner's Captor Was a Converted Fruiter. SUSPECT SHE SOWED MINES Prisoners Formed a Conspiracy to Overcome Germans, but Feared for Women. ON

sm_55428_1109-22845.xml	Miss Lyons, Who Was to Marry William Ducker, Says His Family Forcibly Detained Him. WEDDING GUESTS GATHERED But Prospective Bridegroom Failed to Appear -- Chance Remark Broke Up Plane.	Jan 31, 1916
sm_55428_1109-2328.xml	German Authorities Unable to Find Source of Little Newspaper That Annoys Them. OFFER $10,000 REWARD Editors Counsel the Belgians to be Patient Under Their Wrongs and Await Day of Vengeance.	May 21, 1916
sm_55428_1109-23845.xml	Little Paper Published at the Front by Canadian Scottish a Human Document. DAILY TOLL OF CARNAGE But Stout-Hearted 'Johnny Canuck,' Despite the Nerve-Racking Strain, Is Cheerful Through It All.	May 5, 1916
sm_55428_1109-24116.xml	Mar 19, 1916	19160319
sm_55428_1109-24231.xml	Sheriff at El Paso Wants Militia Called -- Federal Agents Irritated by Canards.	Mar 25, 1916
sm_55428_1109-24450.xml	Jun 2, 1916	19160602
sm_55428_1109-25109.xml	British Issue New List of Vessels Taken Into Kirkwall.	Apr 23, 1916
sm_55428_1109-25680.xm

sm_55428_1109-6808.xml	London View of Air Raids, Jutland Fight, and Somme Prospects.	Oct 29, 1916
sm_55428_1109-6989.xml	German Embassy Puts Letters Aboard the Deutschland, Which Is Fully Laden. CARGO EXCEEDS 1,000 TONS Rubber and Nickel Are Stowed in Her Hold and Oil Taken Aboard Before Trial Dip. METAL AND ACID ON BREMEN Sister Ship, Baltimore Expects, Will Come in a Few Days, and After Her a Zeppelin. SUBMARINE MAY GO TO SEA SOON	Jul 19, 1916
sm_55428_1109-7004.xml	Socialist Declares Bethmann Holweg Made a Positive Statement on the Subject.	Jun 22, 1916
sm_55428_1109-7669.xml	British Officials Say Prisoner Never Was a Spy, but Is a Common Forger. TRADUCED GERMAN PEOPLE His Pamphlet In Darlington Election Asserted Some of the Poor Ate Dog Meat.	Feb 22, 1916
sm_55428_1109-8249.xml	Woe Awaits the Government That Shall Have Failed to Win the War and Cannot Justify Itself for Starting It.	May 31, 1916
sm_55428_1109-8309.xml	Described Progress of Revolt and Contained "Army Orders."	May 1,

sm_55428_1111-15580.xml	General's Column Goes from Columbus, Other from Near Hachita. ORDERS TO WIPE OUT VILLA Carranza's Generals Obey His Command to Co-operate with the Bandit's Pursuers. SECRECY VEILS OPERATIONS Number of American Troops Is Withheld -- Aeroplanes and Radio Will Aid Them.	Mar 16, 1916
sm_55428_1111-15891.xml	Six Tunnel Out of a Detention Camp at Alberta.	Apr 30, 1916
sm_55428_1111-16709.xml	Found Unconscious from Gas Poisoning in Paris Studio.	Jan 11, 1916
sm_55428_1111-17547.xml	Aeroplanes Defending City Rise Only When Raiders Are at Hand, Whereas Frenchmen Are Constantly on Guard High Above Paris	Feb 6, 1916
sm_55428_1111-19047.xml	Tells House Committee He Will Add Materially to It with Men and Machines. TO DROP ALL DEADWOOD Men Doing Well in Mexico with Under-Powered Machines -- Trouble with Wireless, Too.	Apr 9, 1916
sm_55428_1111-22454.xml	Magyar Deputy Says the Officers Insult, Beat and Starve Their Soldiers.	Jan 31, 1916
sm_55428_1111-2402.xml	Artillery Used A

sm_55428_1113-15081.xml	Assails Pro-Germans, Pacifists, and Embargo Advocates in Speech in House. CLASHES WITH LONGWORTH Two Republican Members from Wisconsin Also Accuse Him of Slandering Their Constituents.	Jan 8, 1916
sm_55428_1113-15270.xml	Press Emphasizes the Idea That We Have Made Our Terms More Drastic. BUT ARGUES AGAINST WAR Some Newspapers See Peril for the Empire in an Armed Clash with Us.	Feb 6, 1916
sm_55428_1113-17200.xml	Major Sample Threatens Legal Action Against Offenders.	Mar 19, 1916
sm_55428_1113-17356.xml	He Is Now Loved and Venerated by Those Who Once Hated and Feared Him. THE EMPIRE'S DRIVING FORCE Made So by His Passionate Belief in the Heroic Potentialities of His People. VICTORY HIS SOLE AIM Labor's Ardent Partisan, He Ruthlessly Scraps His Old Opinions In His Munitions Campaign.	Jan 30, 1916
sm_55428_1113-26335.xml	Charles Urban Tells of a Cruise in the North Sea with a Big British Squadron. ICELAND TO HELIGOLAND How a Picture Showing a Bit of Land Caused Jel

In [14]:
len(files)

988

In [15]:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

In [16]:
years = {}

In [17]:
for file in files:
    for month in months:
        try:
            if month in files[file][1]:
                year = files[file][1].split(", ")[1]
                break
            if month in files[file][2]:
                year = files[file][2].split(", ")[1]
                break
        except IndexError:
            continue
    if year in years:
        years[year] += 1
    else:
        years[year] = 1
    print(file + "\t" + year + "\n")

sm_55428_1097-13873.xml	1916

sm_55428_1097-15225.xml	1916

sm_55428_1097-16951.xml	1916

sm_55428_1097-18162.xml	1916

sm_55428_1097-183.xml	Lansing Says

sm_55428_1097-18685.xml	1916

sm_55428_1097-2152.xml	1916

sm_55428_1097-21846.xml	1916

sm_55428_1097-23036.xml	1916

sm_55428_1097-24940.xml	1916

sm_55428_1097-26188.xml	1916

sm_55428_1097-28334.xml	1916

sm_55428_1097-31573.xml	1916

sm_55428_1097-35181.xml	1916

sm_55428_1097-3735.xml	1916

sm_55428_1097-39165.xml	1916

sm_55428_1097-41584.xml	1916

sm_55428_1097-44463.xml	1916

sm_55428_1097-45048.xml	1916

sm_55428_1097-48670.xml	1916

sm_55428_1097-48740.xml	1916

sm_55428_1097-5012.xml	1916

sm_55428_1097-8124.xml	1916

sm_55428_1097-9170.xml	1916

sm_55428_1098-10758.xml	1916

sm_55428_1098-1172.xml	1916

sm_55428_1098-13928.xml	1916

sm_55428_1098-14309.xml	1916

sm_55428_1098-16068.xml	1916

sm_55428_1098-16241.xml	1916

sm_55428_1098-16929.xml	1916

sm_55428_1098-18835.xml	1916

sm_55428_1098-18879.xml	1916

sm_55428_1


sm_55428_1102-20048.xml	1916

sm_55428_1102-20737.xml	1916

sm_55428_1102-21005.xml	1916

sm_55428_1102-21108.xml	1916

sm_55428_1102-21755.xml	1916

sm_55428_1102-21835.xml	1916

sm_55428_1102-22060.xml	1916

sm_55428_1102-235.xml	1916

sm_55428_1102-23665.xml	1916

sm_55428_1102-2408.xml	1916

sm_55428_1102-24257.xml	1916

sm_55428_1102-24789.xml	1916

sm_55428_1102-25627.xml	1916

sm_55428_1102-25875.xml	1916

sm_55428_1102-26070.xml	1916

sm_55428_1102-2610.xml	1916

sm_55428_1102-26397.xml	1916

sm_55428_1102-26880.xml	1916

sm_55428_1102-26934.xml	1916

sm_55428_1102-27064.xml	1916

sm_55428_1102-27429.xml	1916

sm_55428_1102-27891.xml	1916

sm_55428_1102-27910.xml	1916

sm_55428_1102-2825.xml	1916

sm_55428_1102-28948.xml	1916

sm_55428_1102-29411.xml	1916

sm_55428_1102-29421.xml	1916

sm_55428_1102-29873.xml	1916

sm_55428_1102-30381.xml	1916

sm_55428_1102-30478.xml	1916

sm_55428_1102-30490.xml	1916

sm_55428_1102-31268.xml	1916

sm_55428_1102-31416.xml	1916

sm_55428_1102-


sm_55428_1104-4255.xml	He Predicted Clash with Germany

sm_55428_1104-42958.xml	1916

sm_55428_1104-43212.xml	1916

sm_55428_1104-43219.xml	1916

sm_55428_1104-43265.xml	1916

sm_55428_1104-43823.xml	Purporting to be from a Member of House of Commons

sm_55428_1104-4427.xml	1916

sm_55428_1104-44509.xml	1916

sm_55428_1104-44628.xml	1916

sm_55428_1104-44828.xml	1916

sm_55428_1104-44830.xml	1916

sm_55428_1104-4500.xml	Now in Men's Jobs

sm_55428_1104-46049.xml	1916

sm_55428_1104-46702.xml	1916

sm_55428_1104-47779.xml	1916

sm_55428_1104-47823.xml	1916

sm_55428_1104-47907.xml	1916

sm_55428_1104-48648.xml	1916

sm_55428_1104-48856.xml	1916

sm_55428_1104-48942.xml	1916

sm_55428_1104-49030.xml	1916

sm_55428_1104-49110.xml	1916

sm_55428_1104-49877.xml	1916

sm_55428_1104-5176.xml	1916

sm_55428_1104-5413.xml	1916

sm_55428_1104-5512.xml	1916

sm_55428_1104-5537.xml	1916

sm_55428_1104-5776.xml	1916

sm_55428_1104-6084.xml	1916

sm_55428_1104-6171.xml	1916

sm_55428_1104-673.xml	1


sm_55428_1109-2244.xml	1916

sm_55428_1109-22570.xml	1916

sm_55428_1109-22845.xml	1916

sm_55428_1109-2328.xml	1916

sm_55428_1109-23845.xml	1916

sm_55428_1109-24116.xml	1916

sm_55428_1109-24231.xml	1916

sm_55428_1109-24450.xml	1916

sm_55428_1109-25109.xml	1916

sm_55428_1109-25680.xml	1916

sm_55428_1109-2573.xml	1916

sm_55428_1109-25831.xml	1916

sm_55428_1109-27110.xml	1916

sm_55428_1109-28044.xml	1916

sm_55428_1109-28144.xml	1916

sm_55428_1109-28379.xml	1916

sm_55428_1109-28550.xml	1916

sm_55428_1109-29370.xml	1916

sm_55428_1109-29738.xml	1916

sm_55428_1109-30206.xml	1916

sm_55428_1109-31837.xml	1916

sm_55428_1109-31963.xml	1916

sm_55428_1109-32985.xml	1916

sm_55428_1109-33131.xml	1916

sm_55428_1109-33186.xml	1916

sm_55428_1109-33601.xml	1916

sm_55428_1109-34118.xml	1916

sm_55428_1109-34400.xml	1916

sm_55428_1109-35079.xml	1916

sm_55428_1109-3537.xml	1916

sm_55428_1109-36351.xml	1916

sm_55428_1109-36462.xml	1916

sm_55428_1109-37085.xml	1916

sm_55428_1109

In [18]:
# for line in open("1004-1119.txt"):
#     try:
#         name, date1, date2 = line.split("\t")
#     except ValueError:
#         print(line)
#     for month in months:
#         if month in date1:
#             try:
#                 if date1.split(", ")[1].isdigit():
#                     year = date1.split(", ")[1]
#             except IndexError:
#                 continue
#             break
#         if month in date2:
#             try:
#                 if date2.split(", ")[1].isdigit():
#                     year = date2.split(", ")[1]
#             except IndexError:
#                 continue
#             break
#     if int(year) in years:
#         years[int(year)] += 1
#     else:
#         years[int(year)] = 1
        
#     print(name + "\t" + year)

In [19]:
years

{'1916': 956,
 'Lansing Says': 1,
 "Now in Men's Jobs": 2,
 'It Is Said.': 2,
 'Purporting to be from a Member of House of Commons': 2,
 'Given by Woman Who Had No Money': 2,
 'Which Are Imposed by the Society of Surveillance': 1,
 'Although a Little Delay Will Help Finish Plans. NOT THE TIME TO MEDIATE Counselor Polk Declines to Discuss Matter with a Lawyer Employed by Carranza.': 1,
 'His Secretary': 2,
 'Urging a Dictatorship. CONTROL FOR ALL ALLIES Senator Benoist Says the Entente Needs a Government of Governments.': 2,
 'Sister of an Irish Baron': 2,
 'and Says in Art There Can Be Nothing New That Is Not Ugly': 1,
 'Sent in Separate Marked Bags': 2,
 'So Even Page Margins Are to be Cut. ADVERTISING ON INCREASE Members of Association to Fight Houston Plan for Free Space in the Preparedness Campaign.': 2,
 'He Predicted Clash with Germany': 1,
 'SAYS UNION Schlesinger Asserts Protective Association Is Behind Independent Body Urging Mediation.': 2,
 'August Belmont Tells Thompson Com

In [20]:
# sorted keys
for key, value in sorted(years.items(), key=lambda x: x[0]): 
    print("{} : {}".format(key, value))

1916 : 956
1917 : 1
Although a Little Delay Will Help Finish Plans. NOT THE TIME TO MEDIATE Counselor Polk Declines to Discuss Matter with a Lawyer Employed by Carranza. : 1
August Belmont Tells Thompson Committee. CENTRAL PLANS ATTACKED Charge That Estimate Board Would "Sell Out" City Ordered Sent to Mayor. : 2
Given by Woman Who Had No Money : 2
He Predicted Clash with Germany : 1
His Secretary : 2
It Is Said. : 2
Lansing Says : 1
N.M. : 2
Now in Men's Jobs : 2
Purporting to be from a Member of House of Commons : 2
SAYS UNION Schlesinger Asserts Protective Association Is Behind Independent Body Urging Mediation. : 2
Sent in Separate Marked Bags : 2
Sister of an Irish Baron : 2
So Even Page Margins Are to be Cut. ADVERTISING ON INCREASE Members of Association to Fight Houston Plan for Free Space in the Preparedness Campaign. : 2
Stunned by Blow : 2
Urging a Dictatorship. CONTROL FOR ALL ALLIES Senator Benoist Says the Entente Needs a Government of Governments. : 2
Which Are Imposed by

In [21]:
# sorted values
for key, value in sorted(years.items(), key=lambda x: x[1]): 
    print("{} : {}".format(key, value))

Lansing Says : 1
Which Are Imposed by the Society of Surveillance : 1
Although a Little Delay Will Help Finish Plans. NOT THE TIME TO MEDIATE Counselor Polk Declines to Discuss Matter with a Lawyer Employed by Carranza. : 1
and Says in Art There Can Be Nothing New That Is Not Ugly : 1
He Predicted Clash with Germany : 1
1917 : 1
Now in Men's Jobs : 2
It Is Said. : 2
Purporting to be from a Member of House of Commons : 2
Given by Woman Who Had No Money : 2
His Secretary : 2
Urging a Dictatorship. CONTROL FOR ALL ALLIES Senator Benoist Says the Entente Needs a Government of Governments. : 2
Sister of an Irish Baron : 2
Sent in Separate Marked Bags : 2
So Even Page Margins Are to be Cut. ADVERTISING ON INCREASE Members of Association to Fight Houston Plan for Free Space in the Preparedness Campaign. : 2
SAYS UNION Schlesinger Asserts Protective Association Is Behind Independent Body Urging Mediation. : 2
August Belmont Tells Thompson Committee. CENTRAL PLANS ATTACKED Charge That Estimate 

In [22]:
t=0
f=0

for file in files:
    if is_censorship(files[file][0]):
        
        t+=1
    else:
        f+=1
print("true = ", t, "false = ", f)

true =  988 false =  0


In [23]:
# is_censorship(files[""])

In [38]:
len(files)

988

In [25]:
nlp = spacy.load("en")

In [26]:
# we add some words to the stop word list
for file in files:
    article = []
    doc = nlp(files[file][0].lower())
    for w in doc:
        # if it's not a stop word or punctuation mark, add it to our article!
        if not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I' and not '&' in w.text and not ';' in w.text and not '$' in w.text and len(w.text) > 2 and not ' ' in w.text and not "apos" in w.text and not "quot" in w.text and not "nos" in w.text and not "tlhe" in w.text and not "tlie" in w.text and not "tihe" in w.text and not "thie" in w.text and not "andl" in w.text and not "tile" in w.text and not "tho" in w.text:
            # we add the lematized version of the word
            article.append(w.lemma_)
    files[file].append(article)

In [27]:
# stop word list changelog
# 12/2: added "apos" and "quot" as undesirable remnants of .xml punctuation codings

Now that we have our dirty corpus, let us now clean it.

## Clean Corpus

We will clean it by removing stop words, lemmatizing, removing punctuation, numbers, spaces, etc.

In [28]:
# iterate through corpus, clean code

In [29]:
from gensim.corpora import Dictionary
import gensim
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel

In [30]:
cleaned_texts = []

In [31]:
for file in files:
    cleaned_texts.append(files[file][3])

In [32]:
bigram = gensim.models.Phrases(cleaned_texts)

In [33]:
texts = [bigram[line] for line in cleaned_texts]



In [34]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [41]:
ldamodel = LdaModel(corpus=corpus, num_topics=7, id2word=dictionary, passes=10)

In [42]:
ldamodel.show_topics()

[(0,
  '0.009*"german" + 0.006*"war" + 0.005*"say" + 0.004*"time" + 0.003*"great" + 0.003*"man" + 0.003*"censor" + 0.003*"ally" + 0.003*"french" + 0.003*"germany"'),
 (1,
  '0.006*"say" + 0.005*"government" + 0.004*"mail" + 0.004*"american" + 0.003*"state" + 0.003*"united_state" + 0.003*"british" + 0.003*"time" + 0.003*"country" + 0.003*"censor"'),
 (2,
  '0.008*"german" + 0.007*"war" + 0.007*"say" + 0.005*"germany" + 0.004*"time" + 0.004*"government" + 0.004*"american" + 0.003*"man" + 0.003*"come" + 0.003*"peace"'),
 (3,
  '0.007*"american" + 0.005*"say" + 0.005*"government" + 0.004*"united_state" + 0.004*"state" + 0.004*"villa" + 0.003*"force" + 0.003*"german" + 0.003*"man" + 0.003*"order"'),
 (4,
  '0.007*"say" + 0.006*"american" + 0.004*"war" + 0.004*"general" + 0.004*"report" + 0.003*"united_state" + 0.003*"man" + 0.003*"state" + 0.003*"government" + 0.003*"today"'),
 (5,
  '0.004*"german" + 0.004*"british" + 0.004*"time" + 0.003*"new" + 0.003*"man" + 0.003*"life" + 0.003*"war" + 

In [37]:
# list of stop words, following restriction of punctuation, spaces;
# apos quot nos tlhe tlie tihe thie andl tile tho
