# Speaking about family: differences in speech patterns across demographics 

This notebook contains all of the analyses for Group 13's final project in Linguistics 1. 

## Project Proposal

1. **Issue/Behavior of interest** We are interested in exploring how conversations about family differ across gender/age/other demographics. Perhaps there is a gender difference in the types of words surrounding family conversations, or perhaps the gender makeup of the conversations (man-to-man vs. man-to-woman vs. woman-to-woman) result in significant differences in the words. These are the general questions we are interested in pursuing.

2. **Research Question** Are there any notable differences in the keywords or the sentiment used by male speakers and female speakers in conversations about family?

3. **Proposed Methods** Since we are specifically interested in conversations about family, we plan to preprocess the data to only specifically select conversations with a listed “Theme/Topic” of Family, or conversations where family related keywords could appear. Within this subset of the data, we plan to compare 1) the frequency of particular content words and 2) the sentiment of content words across different speaker populations (male, female, young, old). Comparing the frequency of content words will be straightforward; doing sentiment analysis will be slightly more complicated, but open source sentiment analysis software (NLTK) will simplify the task immensely. As a first past, we’ll look at whether females/males speak more/less positively while speaking about family.

4. **Further Questions** Is there a dependence on the speaker’s content words (either frequency or sentiment) on the listener? Are there interaction effects between the age and sex of a speaker on the sentiment of their speech? 


## Data Preprocessing

The raw data contains 17683 entries. Each entry corresponds to a single word, and a number of features related to the setting in which the word was spoken:
* Cafe Name
* Time of Day
* Speaker Age and Gender
* Listener 1 and 2 Age and Gender
* Theme/Topic
* Formality (Informal, Neutral, Formal)

### Filtering Theme/Topic
For our study, we are interested solely in entries whose Theme/Topic contains the word family. Thus, we filtered the entries to include only those who's 'Theme/Topic' column contained the word 'Family'. 

Of these remaining entries, one conversation was labeled as 'General discussion among family'; we discarded all such entries because we are interested in conversations *about* family, not just *among* family.

### Fixing Ages
Many of the recorded ages included ranges rather than single numbers (e.g. '18-25', '30s'). In these cases, we approximate these ages with the mean of the range.

### Phrases to words
Despite clear instruction, some students failed to understand that the 'Content Words (one word, not phrases)' column was to contain one word, not phrases. We limit our analysis to entries with only a single word in the Content Words column.

The CSV with the clean data can be found [here](all_clean_data.csv). It only contains columns with Speaker Age and Gender, Listener 1 Age and Gender, Theme/Topic (with an additional logical column indicating whether it is about 'family'), and the content word. 

## Loading the Data

Now that we have the data in a clean CSV file, we need to get it in a format that we can use in python for further analysis. We'll use [pandas](http://pandas.pydata.org/), a popular tool for handling data in python.

In [80]:
import pandas
all_data = pandas.read_csv('all_clean_data.csv')

Here's the top of our data table:

In [52]:
all_data.head()

Unnamed: 0,Speaker Age,Speaker Gender,Listener 1 Age,Listener 1 Gender,Content Word,Theme/Topic,Is about Family
0,45,male,15,female,Allergic,Allergies,0
1,45,male,15,female,Reactions,Allergies,0
2,45,male,15,female,Lactose,Allergies,0
3,45,male,15,female,Free,Allergies,0
4,45,male,15,female,Milk,Allergies,0


## Sentiment Analysis

Our project makes use of 'sentiment analysis', a technique in natural language processessing that evaluates the sentiment of words, sentences, or other bodies of text. There are many approaches to sentiment analysis (**CITATIONS**). 

### SentiWordNet
**ISHAN**, fill this in with details about how sentiwordnet is trained, the underlying model, etc. and cite the source code

### Generating Sentiment Scores

Now, we generate the sentiment scores for all of the content words. Once we have the score for each word, we'll look at differences between sentiment between populations (across age, gender) and when talking about family.

Notes about these scores: as each word could have many meanings (each with different sentiments), we use the mean of each of these scores

In [81]:
from os import listdir
from os.path import isfile, join, getsize
from collections import defaultdict
from collections import Counter
from os import getcwd
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn
import numpy as np
import time

has_sentiment = np.zeros(len(content_word_list));
content_word_positive_scores = -np.ones(len(content_word_list));
content_word_negative_scores = -np.ones(len(content_word_list));
content_word_objective_scores = -np.ones(len(content_word_list));

for i, word in enumerate(content_word_list):
    synsets = swn.senti_synsets(word)
    if synsets:
        has_sentiment[i] = 1
        content_word_positive_scores[i] = np.mean([syns.pos_score() for syns in synsets])
        content_word_negative_scores[i] = np.mean([syns.neg_score() for syns in synsets])
        content_word_objective_scores[i] = np.mean([syns.obj_score() for syns in synsets])
    if i % 1000 == 0:
        print 'Processing word: ', i

all_data['Positive Score'] = content_word_positive_scores
all_data['Negative Score'] = content_word_negative_scores
all_data['Objective Score'] = content_word_objective_scores

Processing word:  0
Processing word:  1000
Processing word:  2000
Processing word:  3000
Processing word:  4000
Processing word:  5000
Processing word:  6000
Processing word:  7000
Processing word:  8000
Processing word:  9000
Processing word:  10000
Processing word:  11000
Processing word:  12000
Processing word:  13000
Processing word:  14000
Processing word:  15000


In [86]:
print 'Total number of words: ', len(content_word_list)
print 'Number of words with sentiment data: ', sum(has_sentiment) 
sentiment_data = all_data[has_sentiment == 1]
sentiment_data.tail()

Total number of words:  15501
Number of words with sentiment data:  14113.0


Unnamed: 0,Speaker Age,Speaker Gender,Listener 1 Age,Listener 1 Gender,Content Word,Theme/Topic,Is about Family,Positive Score,Negative Score,Objective Score
15495,30,male,35,female,like,weekend plans,0,0.284091,0.022727,0.693182
15496,30,male,35,female,phone,weekend plans,0,0.0,0.0,1.0
15498,40,female,40,female,Latte,Which coffee shops are better,0,0.0,0.0,1.0
15499,40,female,40,female,Local,Which coffee shops are better,0,0.0,0.0,1.0
15500,40,female,40,female,Quality,Which coffee shops are better,0,0.357143,0.0,0.642857
