Eva Bacas, <ENB30@pitt.edu>, January 13, 2019

+ Data set: The Blog Authorship Corpus
+ Authors/citation: J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs.
+ URL: <http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm>
+ Makeup:
    + Collected posts of 19,320 bloggers gathered from blogger.com in August of 2004. 
    + Size: 681,288 posts, or 140 million+ words - which is ~35 posts/7250 words per blogger
    + Blogger info: ID#, self-provided gender, industry, and astrological sign
    + Categories: all blogs fall into three age groups, with an equal number of male and female bloggers in each category
        + "10s": 13-17, numbering 8240 bloggers
        + "20s": 23-27, numbering 8086 bloggers
        + "30s": 33-47, numbering 2994 bloggers
+ Language: English
+ Format: .xml
    + Each file is titled like this: IDnumber.gender.age.industry.astrological sign.xml, and contains all blog posts by that specific author.

Self-assessment:

I created lists containing the file names, the blog texts, the bloggers' genders, the bloggers' ages, the bloggers' occupations, and the blogger's astological signs. First I iterated through the blogs folder to grab all the file names, and then iterated through the file names to create the other five lists. I calculated the median and mean age of the bloggers (~23), and confirmed the total number of blog files (19,320). I found the average word count of each blog file: ~41729. This is very different from the average number of words listed by the authors of the corpus - 7250, which I believe is because of my inability to remove the XML formatting (more on that below).

I then created a Numpy structured array for the demographic data. After figuring out I couldn't really do much with the structured array, I used it to create a DataFrame using Pandas. I poked around in the DataFrame to investigate the relationship between age and occupation. The most common listed occupation (~5000 bloggers), "student", was also the youngest, with an average age of ~17. The occupations with the oldest mean age are "construction" and "consulting", both ~30. This discovery wasn't super surprising to me, since I expected the majority of the teen users would be students.

I wish I had been able to do more with actual blog text. I tried removing the XML formatting using beautiful soup, but it didn't work so well, since XML is not HTML. Also, when I tried to run it on all the files, it crashed the kernel every time. I also tried to use the nltk ElementTree module, but I couldn't get it to work on these files. In the future, I would like to look at how the frequency of certain words varies across age and gender: it could be an interesting insight into slang/new words at the time. If there was a way to go through and tag blogs for topic, it could also be interesting to see how the themes of a person's blogs vary with age and gender. 

In [1]:
import os
import numpy as np
import pandas as pd

1. I created six empty lists: one for the file names, one for the actual text contained in the files, and four for the demographics of the bloggers.

In [2]:
blog_names=[]
blog_txt=[]
blogger_gender=[]
blogger_age=[]
blogger_occupation=[]
blogger_sign=[]

2. I iterated through all the files in the folder 'blogs' and added the names to blog_names. 

In [3]:
for filename in os.listdir('blogs'):
    blog_names.append(filename)

3. I iterated through the filenames, read the files into blog_txt, and added the demographics into their respective lists. Since the demographic data for the bloggers was only contained in the filename, I split the filename on the dots to access the four demographic categories.

In [4]:
for filename in blog_names:
    f = open('blogs/'+filename,'rb')
    txt = f.read()
    f.close()
    blog_txt.append(txt)
    blogger_gender.append(filename.split('.')[1])
    blogger_age.append(int(filename.split('.')[2]))
    blogger_occupation.append(filename.split('.')[3])
    blogger_sign.append(filename.split('.')[4])

Here's some short samples of the texts:

In [5]:
blog_txt[0][100:300]

b'thing     but i can hear     you have chosen me, your life partner     so have i dear,     so have i dear....      my first dream, my first extreme,     my first love, i was waiting for my DESTINY.   '

In [6]:
blog_txt[40][400:600]

b'have those), I like my body.  I have a little belly pudge, but I think just about the right amount.  If I wanted the perfect body, I would reduce the pudge by a little bit, but overall, I look good.  '

In [7]:
blog_txt[90][600:800]

b'mendous, constantly shifting earth.  sounds biblically hazardous. minnesota is from a dakota indian word meaning  sky-tinted water.  license plates there read "land of 10,000 lakes." there are actuall'

In [8]:
blog_txt[10000][900:1100]

b"love the rain (and I'm not ready for the blistering days of summer, thankyouverymuch), so I don't want it to go, but I also don't want it to come around so much.  On a whim, I tuned into an internet r"

4. I calculated the mean and median blogger age using Numpy: both around 23.

In [9]:
np.mean(blogger_age)

22.83379917184265

In [10]:
np.median(blogger_age)

23.0

5. Though I already know how many bloggers are included in this corpus, I checked by looking at the length of the blog_names list:

In [11]:
len(blog_names)

19320

6. Just to see how long an average blogger's blogs might be, I checked the length of the first blog in blog_txt. Then I used a for loop to figure out the average word count for all the blog files: about 42,000 words. This is pretty different from the average word count listed by the authors, which is likely because I did not remove the XML formatting.

In [12]:
len(blog_txt[0])

7080

In [13]:
wordcount=0
for blogger in blog_txt:
    wordcount+=len(blogger)
print(wordcount/len(blog_txt))

41728.7147515528


7. After reading  over the Numpy chapter of [The Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/), I really wanted to encorporate the Numpy structured array. So I made one, after struggling a lot, and then figured out I couldn't really do what I was hoping to do with it. I kept reading and learned about Pandas DataFrames, which sounded more useful and interesting. We also just did DataFrames with R in Ling Stats, and they were fun. So I gave that a go. 

In [None]:
blogdata = np.zeros(len(blog_names), dtype={'names':('gender', 'age', 'occupation','sign'),
                              'formats':('U10', 'i4', 'U25','U25')})
blogdata['gender'] = blogger_gender
blogdata['age'] = blogger_age
blogdata['occupation'] = blogger_occupation
blogdata['sign'] = blogger_sign

In [16]:
blogstats = pd.DataFrame(blogdata)

8. I decided the discovery I wanted to make was the relationship between blogger age and occupation. I figured, since a lot of the users were teens, their occupations were probably 'student'. First I used .loc to take a peek at the occupations of bloggers over 40:

In [19]:
blogstats.loc[blogstats.age > 40, ['age', 'occupation']]

Unnamed: 0,age,occupation
9,45,Technology
100,45,indUnk
117,41,indUnk
120,47,indUnk
121,45,indUnk
137,44,Arts
158,47,indUnk
163,43,Chemicals
165,41,Technology
183,45,indUnk


9. And then I looked at bloggers under 15 - mostly students, as I suspected. I wonder if "non-profit" is a joke about school being "working for free". 

In [20]:
blogstats.loc[blogstats.age < 15, ['age', 'occupation']]

Unnamed: 0,age,occupation
12,14,Non-Profit
18,14,Arts
23,14,Student
37,14,Student
43,14,Student
52,13,indUnk
73,14,Student
84,14,Student
86,13,Biotech
93,14,Student


10. I tried out .groupby, and found the mean age for each astrological sign. All fairly close to the overall mean, which makes sense since astrological sign and age are unrelated. 

In [23]:
blogstats.groupby('sign')['age'].mean()

sign
Aquarius       23.098372
Aries          22.994364
Cancer         23.648084
Capricorn      22.666667
Gemini         22.915361
Leo            23.293391
Libra          22.046471
Pisces         23.080380
Sagittarius    22.587476
Scorpio        22.301042
Taurus         22.903951
Virgo          22.496915
Name: age, dtype: float64

11. Then I tried making a pivot table, looking at age as a factor of occupation and gender. Youngest overall occupation for both genders: student, at ~17. Looks like the oldest careers are construction, consulting, maritime (but just for women), and museums/libraries, all ~30. There are huge gender gaps for agriculture, biotech, and maritime, and women are always the older group.

In [30]:
blogstats.pivot_table('age', index='occupation', columns='gender')

gender,female,male
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Accounting,27.283784,24.645161
Advertising,27.2,27.528571
Agriculture,25.85,19.5
Architecture,25.571429,26.676471
Arts,25.076372,25.480132
Automotive,25.705882,25.675676
Banking,27.297872,26.338462
Biotech,28.55,21.405405
BusinessServices,26.382716,26.414634
Chemicals,20.565217,22.25641


12. Lastly I tried .describe() to summarize the distribution of ages for each occupation. Not feeling so great about how the minimum age for every occupation is 13-16, but hopefully it's just kids saying what they want to do, and not actual 13 year olds working in the chemicals industry. This confirms that the oldest categories are construction and consulting. Here you can see the overall count for each occupation as well, with student being the largest group (~5000) aside from "unknown" (~6000). There's only 17 people working in maritime careers, the smallest category which probably contributes to the huge gender gap in average age. Environment, investment banking (why differentiate this from banking?), and agriculture are also pretty small.

In [29]:
blogstats.groupby('occupation')['age'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Accounting,105.0,26.504762,6.922822,15.0,23.0,25.0,27.0,47.0
Advertising,145.0,27.358621,6.443856,13.0,24.0,25.0,27.0,47.0
Agriculture,36.0,23.027778,8.68985,14.0,16.0,23.0,26.0,47.0
Architecture,69.0,26.115942,6.846073,14.0,23.0,25.0,27.0,48.0
Arts,721.0,25.245492,8.247015,13.0,17.0,25.0,27.0,48.0
Automotive,54.0,25.685185,8.072371,14.0,23.0,25.0,26.0,46.0
Banking,112.0,26.741071,5.232858,13.0,24.0,25.0,27.0,48.0
Biotech,57.0,23.912281,7.973798,13.0,17.0,24.0,26.0,44.0
BusinessServices,163.0,26.398773,7.272933,14.0,23.0,25.0,27.0,47.0
Chemicals,62.0,21.629032,6.574251,13.0,16.0,23.0,25.0,43.0
