# Demos for analyzing World Color Survey (WCS)

COG 260: Data, Computation, and The Mind (Yang Xu)

Data source: http://www1.icsi.berkeley.edu/wcs/data.html

______________________________________________

Import helper function file for WCS data analysis.

In [None]:
from wcs_helper_functions import *

Import relevant Python libraries.

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from random import random
%matplotlib inline

## Demo 3: Import color naming data
    
> Each of the 330 color chips was named by speakers of 110 different languages.

______________________________________________

Load naming data. 

`namingData` is a hierarchical dictionary organized as follows:

**language _(1 - 110)_ &rarr; speaker _(1 - *range varies per language*)_ &rarr; chip index _(1 - 330)_ &rarr; color term**

In [None]:
namingData = readNamingData('term.txt')

For example, to obtain naming data from language 1 and speaker 1 for all 330 color chips:

In [None]:
unique_colour_list = [] #contains 
#b = {}

for val in namingData: 
    #print(val)
    for s in namingData[val]:
        unique_colour_list.append(len(list(set(list(namingData[val][s].values())))))

In [None]:
#unique_colour_list

## Demo 5: Import speaker demographic information

> Most speakers' age _(integer)_ and gender _(M/F)_ information was recorded.

______________________________________________

Load speaker information.

`speakerInfo` is a hierarchical dictionary organized as follows:

**language &rarr; speaker &rarr; (age, gender)**

In [None]:
speakerInfo = readSpeakerData('spkr-lsas.txt')

In [None]:
gender_age = []
gender = []
age = []

keys = list(namingData.keys())

speaker = [] # list of total number of speakers in each language 

for val in speakerInfo:
    #print(val)
    speaker.append(len(list(speakerInfo[val].keys())))
    for s in speakerInfo[val]:
        gender_age.append((speakerInfo[val][s][0][1],speakerInfo[val][s][0][0]))
        gender.append(speakerInfo[val][s][0][1])
        age.append(speakerInfo[val][s][0][0])
        

In [None]:
s = np.arange(1,111)
s= [str(i) for i in s] 
speaker_col = []
#print(s)

nested = [list(v*(s[i],)) for v, i in zip(speaker, range(len(s)))]
#nested


In [None]:
language = [ item for elem in nested for item in elem]

In [None]:
full_df = pd.DataFrame(list(zip(language, gender, age, gender_age, unique_colour_list)), columns = ['language', 'gender', 'age', 'gender_age','unique' ])

In [None]:
clean = full_df[full_df.gender != '*']
clean = clean[clean.gender != 'X']
clean.gender[clean.gender == 'f'] = "F"
    

clean['gender'].describe()

In [None]:
clean.head(10)

In [None]:
clean['unique'].describe()

In [None]:
#create subplot for each language and identify the trends with linear regression lines across 110 languages. 
# find how the number of each gender varies across the languages  

In [None]:
'''
fig = plt.figure(figsize=(30, 20), dpi= 80, facecolor='w', edgecolor='k')

import matplotlib.pyplot as plt
import seaborn as sns

#sns.scatterplot(x="age", y="unique", data=clean)

#plt.scatter(clean['age'], clean['unique'])
'''

In [None]:
'''
fig = plt.figure(figsize=(30, 20), dpi= 80, facecolor='w', edgecolor='k')

# Loop over the subjects
for s in range(1,111):    
    
    # Task 1: Line fitting [3pts]
    
    #age vs unique across the langauges , gender as a the filler 
    x = clean['age']
    y = clean["unique"]

    m, b = np.polyfit(x, y, 1) # m = slope, b=intercept.

    
    #-------Task 1.2-------
    # Record the slope for this subject in place-holder variable "slopes"
    
    slopes = np.append(slopes,m) 
 

    # Create a subplot for this subject
    plt.subplot(6,9,s+1);
    plt.title('s'+str(s+1));


    # Task 2: Within-subject visualization [2pts]

    #-------Task 2.1-------
    # Scatter plot reaction times (y-axis) against angles (x-axis)
    plt.plot(x, y, 'ro');

    #-------Task 2.2-------
    # Juxtapose the fitted line onto this scatter plot
    plt.plot(x, m*x + b) 


    # Specify title of the plot by subject index
    plt.title('s'+str(s+1));
'''

In [None]:
#three scatter subplots to 
#visualize the gender distribution across languages, 
#gender vs number of unique colour names, 
#age vs number of unique colour names 
