Now that Brooks has done all the hard work putting together a subsample of the ICPSR data in a delimiter-separated-values file, we can import it into Python to start analyzing it! In this notebook we will see two different Python objects we can use for further text analysis techniques: the Pandas DataFrame and the scikit-learn Document Term Matrix.

The first is a Pandas Dataframe:

In [99]:
import pandas
#use the pandas read_csv function to read in the delimited file.
df = pandas.read_csv("2pipe/33501-0015-Data-1ksubsamp.w.meta.txt", sep="|") #separator value set to "|"

#see what the dataframe looks like
df

Unnamed: 0,speechID,speech,speakerID,date,num_order_within_file,file,line_start,line_end,page,char count,word count,state,name,chamber,district,party
0,10419950001379,In this particular CR?,104128,950119,34,Hcr19ja95-62 1.txt,169,170,H344-H345,60,13,IL,Cardiss Collins,H,7,D
1,10419950005157,"Mr. Speaker, I ask unanimous consent that all ...",104400,950202,31,Hcr02fe95-56 1.txt,685,727,H1078-H1142,2523,444,UT,Enid Greene Waldholtz,H,2,R
2,10419950005955,"Mr. Speaker, I thank the gentleman for yieldin...",104128,950203,257,Hcr03fe95-77 1.txt,3882,3882,H1168-H1191,39,7,IL,Cardiss Collins,H,7,D
3,10419950007261,"Pursuant to the order of the House of May 12, ...",104900,950209,212,Hcr09fe95-60 1.txt,3986,3989,H1472-H1526,65,11,,Speaker,,,
4,10419950008401,"Mr. Chairman, this amendment deals with striki...",104222,950210,4,Hcr10fe95-60 1.txt,72,153,H1604-H1608,5184,969,MS,Gene Taylor,H,5,D
5,10419950010486,"Mr. Speaker, if the gentleman will yield furth...",104275,950216,159,Hcr16fe95-87 1.txt,2512,2513,H1862-H1890,89,16,NY,Benjamin A. Gilman,H,20,R
6,10419950011475,"I was going to say, whether you are talking ab...",104121,950222,102,Hcr22fe95-85 1.txt,3603,3630,H2010-H2029,1784,310,ID,Michael D. Crapo,H,2,R
7,10419950012614,Objection is heard,104900,950227,1,Hcr27fe95-62 1.txt,17,18,H2260-H2261,80,15,,Speaker,,,
8,10419950013345,The gentleman from Colorado is recognized for ...,104374,950228,24,Hcr28fe95-76 1.txt,302,307,H2345-H2373,358,69,TX,Joe Barton,H,6,R
9,10419950015335,"Mr. Speaker, I move that the House do now adjo...",104045,950306,57,Hcr06mr95-63 1.txt,840,841,H2663-H2703,111,20,CA,Carlos J. Moorhead,H,27,R


Easy! We now have a typical table-type dataframe with which most of us are familiar, and we can do any number of splicing, summarizing, or analyses on. For a quick-and-dirty analysis, let's look at the number of words per speech, and find the average number of words per speech, and finally the average number of words per speech for Democrats vs. Republicans.

Note: I'm not using the built in "word count" variable because it looks completely wrong to me. It would be a good idea to check why this is the case.

In [103]:
###Side cleaning: do some quick cleaning to mark D as Democratic and R as Republican

#df['party'] = df['party'].map({"D": "Democratic", "R": "Republican"})
#print(df['party'])

df['party'] = df['party'].replace('D', 'Democratic')
df['party'] = df['party'].replace('R', 'Republican')

#Now, create a column called "word_count" with each speech's word count, using the split function
## #splitting on white space is the fastest way to get word count
## #but there are more accurate ways to do so, as we'll see below

df['word_count'] = df['speech'].apply(lambda x: len(x.split(' ')))
#check output, including built-in "word count" column which looks wrong to me 
df[['speech', 'word_count', 'word count']]

Unnamed: 0,speech,word_count,word count
0,In this particular CR?,4,13
1,"Mr. Speaker, I ask unanimous consent that all ...",25,444
2,"Mr. Speaker, I thank the gentleman for yieldin...",865,7
3,"Pursuant to the order of the House of May 12, ...",64,11
4,"Mr. Chairman, this amendment deals with striki...",382,969
5,"Mr. Speaker, if the gentleman will yield furth...",132,16
6,"I was going to say, whether you are talking ab...",450,310
7,Objection is heard,3,15
8,The gentleman from Colorado is recognized for ...,9,69
9,"Mr. Speaker, I move that the House do now adjo...",29,20


It passes the sniff test (see rows 0, 7, 994, and 999. Why is "word count" so wrong?). Now we can calculate the total number of words, and the averge number of words per speech, and the average number of words per speech for democrats versus republicans.

In [104]:
print("Total number of words in database:")
print(df['word_count'].sum())
print()
print("Average number of words per speech")
print(df['word_count'].mean())

#groupby party affiliation
groupby = df.groupby("party")

print("________________________")
print("Average number of words per speech by Party Affiliation:\n")

print(groupby.word_count.mean().sort_values(ascending=False))

Total number of words in database:
200604

Average number of words per speech
200.604
________________________
Average number of words per speech by Party Affiliation:

party
Democratic     272.608696
Republican     223.231771
L               19.000000
Independent     14.000000
Name: word_count, dtype: float64


Great! 

But this type of object is actually quite inefficient for large text-based data.

Another way of storing text-based data is the document-term matrix, and we can do so in sparse format (so don't save the elements that have a value of zero). In Python, we can do this using scikit-learn. The object is called Compressed Sparse Format, and it saves quite a bit of memory.

In [105]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

sklearn_dtm = CountVectorizer().fit_transform(df.speech)

#take a look at the object

####
## The shape of the object shows the number of rows, or documents (first element)
## and the number of unique vocab (second element)
####

print(sklearn_dtm.shape)
####
## We can also see what the object looks like.
## The first element is the document ID, the second element is the vocab ID, and the third
## outside the paranthases, is the word count for each document
print(sklearn_dtm)

(1000, 11389)
  (0, 5474)	1
  (0, 10331)	1
  (0, 7493)	1
  (0, 2886)	1
  (1, 6851)	1
  (1, 9647)	1
  (1, 1253)	1
  (1, 10649)	1
  (1, 2630)	1
  (1, 10292)	1
  (1, 957)	1
  (1, 6613)	1
  (1, 6558)	1
  (1, 5064)	1
  (1, 6177)	1
  (1, 3053)	1
  (1, 11263)	1
  (1, 11181)	1
  (1, 10412)	1
  (1, 8866)	1
  (1, 1050)	1
  (1, 4219)	1
  (1, 10297)	1
  (1, 8653)	1
  (1, 7242)	1
  :	:
  (997, 10855)	1
  (997, 8729)	1
  (997, 4995)	1
  (997, 3755)	1
  (997, 3208)	1
  (997, 3592)	2
  (997, 972)	1
  (997, 5025)	2
  (997, 2299)	1
  (997, 10311)	1
  (997, 6397)	1
  (998, 1050)	1
  (998, 10293)	2
  (998, 7200)	1
  (998, 8276)	1
  (998, 11363)	1
  (998, 9998)	1
  (998, 4504)	1
  (998, 649)	1
  (999, 5828)	1
  (999, 11264)	1
  (999, 5846)	1
  (999, 9556)	1
  (999, 7144)	1
  (999, 7303)	1


Compare the (approximate) memeory needed for each object, the Pandas dataframe (df) and the sklearn object (sklearn_dtm):

In [106]:
import sys
df_new = df[['speech']] #to make comparative to the DTM, which is just text
print(sys.getsizeof(df_new))
print(sys.getsizeof(sklearn_dtm)) 

1233480
56


The dtm is much much smaller!

We can also easily reproduce what we did above with the Pandas dataframe, using this dtm object, and can do so using a lot less memory.

In [107]:
import numpy as np
print("Total number of words in the dataframe:")
print(np.sum(sklearn_dtm, axis=0).sum())
print("Average number of words per speech:")
print((np.sum(sklearn_dtm, axis=0).sum())/sklearn_dtm.shape[0])

Total number of words in the dataframe:
194295
Average number of words per speech:
194.295


Note the slight difference in output. Why might this be the case? How could we figure out which speeches are causing the discrepancy? 