## Working with Tweets
#### Or literally any list of dictionaries or other nested data structure

Here are some tweet texts. In a5, you'll have some functions that take in tweets, split them into lists of words, and those word lists will be stored in dictionaries. 

Sort of like this: 

In [51]:
tweets = [
    {'words': ['this', 'is', 'my', 'first', 'tweet']},
    {'words': ['did', 'you', 'hear', 'about', 'CardiB']},
    {'words': ['OMG', 'yassssss']}
]

In [52]:
##Print every tweet in tweets
for tweet in tweets: 
    print(tweet)

{'words': ['this', 'is', 'my', 'first', 'tweet']}
{'words': ['did', 'you', 'hear', 'about', 'CardiB']}
{'words': ['OMG', 'yassssss']}


#### The question here is how many words are in all my tweets. 
You know how to do this with a loop. 

Here, I loop over all the tweets and print the count of words in each tweet's word list using the built in `len()` function 

In [53]:
for tweet in tweets: 
    print(len(tweet['words']))

5
5
2


Excellent! My computer is computing!

Now I just want to sum all of those individual counts into one count value. 

In [54]:
count = 0
for tweet in tweets: 
    count += len(tweet['words'])
count

12

#### Reducing

That's it! 

This work of smashing all these values into one value is known as *reducing*

If you want to have a "reducing function" you just turn this into a function `combine()`

In [55]:
def combine(total, item): 
    total = total + len(item['words'])
    return total


##Looping Option: 
##The "easy" way
count = 0
for tweet in tweets: 
    count = combine(count, tweet)
print(count)


12


In [56]:
##Reducing Function Option: 
#The "hard" (but efficient) way: 

##Or just use built in reduce function

#remember these three lines? 
    # count = 0
    # for tweet in tweets: 
    #     count = combine(count, tweet)
    
#You can cut them down to one!


##Recall: reduce(function_to_call, list_to_apply_function_to, optional_starting_value)
from functools import reduce

reduced_count = reduce(combine, tweets, 0) #0 is the starting point, equivalent to count = 0

print(reduced_count)

12


Here's a tip on how you can store information in dictionaries, just set the key equal to some value. 

In [57]:
some_dict = {}
print(some_dict)

{}


In [58]:
#now add some key and some value
some_dict['words'] = ['cat', 'dog', 'guinea pig']

In [59]:
print(some_dict)

{'words': ['cat', 'dog', 'guinea pig']}


In [60]:
#now add some more things to some_dict
some_dict['word_len'] = len(some_dict['words'])

In [61]:
print(some_dict)

{'words': ['cat', 'dog', 'guinea pig'], 'word_len': 3}


At a very basic level, do you see how you can add items to a dictionary?

Now, let's try something only-a-little-more complicated

## Appending to Existing Data

Later in the assignment you might be wondering, like, do I need to create a new data structure...or what? Or can I just append some information to an existing structure (like a tweet). If your data structure is a list of dictionaries, and you're interrogating the dictionary, there's no reason, not to just tack more info on to the existing. 

In [63]:
#Let's say I have some data that looks like this: 

matt_data = {'created_at': 'Mon Oct 10 18:39:51 +0000 2016',
 'entities': {'favorites': [{'hobby': 'biking',
    'text': 'I really enjoy biking'}, {'hobby': 'reading',
    'text': 'I love me a good book'}]},
 'user': {'screen_name': 'Matt'},
 'quote': 'You know, they do not pay you to juggle one ball'}

matt_data

{'created_at': 'Mon Oct 10 18:39:51 +0000 2016',
 'entities': {'favorites': [{'hobby': 'biking',
    'text': 'I really enjoy biking'},
   {'hobby': 'reading', 'text': 'I love me a good book'}]},
 'user': {'screen_name': 'Matt'},
 'quote': 'You know, they do not pay you to juggle one ball'}

In [64]:
#I can access certain interesting elements: 
matt_data['entities']['favorites']

#hmmm... that's interesting

[{'hobby': 'biking', 'text': 'I really enjoy biking'},
 {'hobby': 'reading', 'text': 'I love me a good book'}]

In [65]:
#I can create a list of hobbies by accessing the hobby elements

hobby_list = []
for hobby in matt_data['entities']['favorites']: 
    hobby_list.append(hobby['hobby'])
    
hobby_list

['biking', 'reading']

If you can do it with a loop, you can do it with a list comprehension

In [66]:
# I don't just want to create a list of hobbies though, I want to add it to my dictionary
# as an element in the dictionary
matt_data['hobbies'] = [hobby['hobby'] for hobby in matt_data['entities']['favorites']]
matt_data

{'created_at': 'Mon Oct 10 18:39:51 +0000 2016',
 'entities': {'favorites': [{'hobby': 'biking',
    'text': 'I really enjoy biking'},
   {'hobby': 'reading', 'text': 'I love me a good book'}]},
 'user': {'screen_name': 'Matt'},
 'quote': 'You know, they do not pay you to juggle one ball',
 'hobbies': ['biking', 'reading']}

See how I've actually appended my hobbies onto my original data structure?

Now imagine I had not one, but many similar data objects, I could set it up to do this kind of work on a larger scale. 

Here's another example: 

In [67]:
matt_data['hobbies_cnt'] = len(matt_data['hobbies'] )
matt_data

{'created_at': 'Mon Oct 10 18:39:51 +0000 2016',
 'entities': {'favorites': [{'hobby': 'biking',
    'text': 'I really enjoy biking'},
   {'hobby': 'reading', 'text': 'I love me a good book'}]},
 'user': {'screen_name': 'Matt'},
 'quote': 'You know, they do not pay you to juggle one ball',
 'hobbies': ['biking', 'reading'],
 'hobbies_cnt': 2}

You don't have to do this in multiple cells either, you can group this stuff into a few lines

In [68]:
##Helper function that splits strings
import re
def split_string(text): 
    return re.split(r'\W+', text)


In [69]:
matt_data['quote_words'] = split_string(matt_data['quote'])
matt_data['words_cnt'] = len(matt_data['quote_words'])
matt_data

{'created_at': 'Mon Oct 10 18:39:51 +0000 2016',
 'entities': {'favorites': [{'hobby': 'biking',
    'text': 'I really enjoy biking'},
   {'hobby': 'reading', 'text': 'I love me a good book'}]},
 'user': {'screen_name': 'Matt'},
 'quote': 'You know, they do not pay you to juggle one ball',
 'hobbies': ['biking', 'reading'],
 'hobbies_cnt': 2,
 'quote_words': ['You',
  'know',
  'they',
  'do',
  'not',
  'pay',
  'you',
  'to',
  'juggle',
  'one',
  'ball'],
 'words_cnt': 11}

## Pandas

I'll introduce two data structures: 
* Series
* Data Frames


## Series

In [31]:
import pandas as pd #python data analysis library
import numpy as np #num is short for numerical

In [32]:
# create a Series from a list
number_series = pd.Series([1, 2, 2, 3, 5, 8])
print(number_series)

0    1
1    2
2    2
3    3
4    5
5    8
dtype: int64


In [47]:
# create a Series from a dictionary
ages = pd.Series({'sarah':42, 'amit':35, 'zhang':13})
ages

sarah    42
amit     35
zhang    13
dtype: int64

See how this looks a little like a table? 

In [39]:
#Series Operation
results = number_series + 4
print(results)

0     5
1     6
2     6
3     7
4     9
5    12
dtype: int64


In [41]:
greater_than = number_series >2
print(greater_than)

##This is really useful for applying changes without having to loop

0    False
1    False
2    False
3     True
4     True
5     True
dtype: bool


In [48]:
#Can still add pairwise
ages + pd.Series({'sarah':10, 'amit':5, 'zhang':3})

sarah    52
amit     40
zhang    16
dtype: int64

In [49]:
ages + pd.Series({'sarah':10, 'amit':5, 'zhang':3, 'anja': 12})

##notice 2 things: NaN and ordering

amit     40.0
anja      NaN
sarah    52.0
zhang    16.0
dtype: float64

In [54]:
ages_mean = ages.mean()
ages_mean

30.0

In [55]:
ages.index

Index(['sarah', 'amit', 'zhang'], dtype='object')

In [57]:
ages.head(2)

sarah    42
amit     35
dtype: int64

In [62]:
s1 = pd.Series([1, 2, 2, 3, 5, 8])
s2 = pd.Series([11, 21, 21, 31, 1, 81])
(s1 < s2).all()

#So it's not true that all elements of s1 are less than s2
# Recall, all we need is one counter-example

False

In [64]:
print(ages)
ages[0]

sarah    42
amit     35
zhang    13
dtype: int64


42

In [67]:
#multiple indices

#We can access with bracket notation
#within that we put a list
#s1 []
#s1 [ [index_list] ]

s1[[0, 3, 5]]

0    1
3    3
5    8
dtype: int64

In [73]:
#Boolean indexing
filter_indices = [True, False, False, True, False, True]
s1[filter_indices]

0    1
3    3
5    8
dtype: int64

In [76]:
shoe_sizes = pd.Series([5.5, 11, 7, 8, 4])
shoe_sizes

0     5.5
1    11.0
2     7.0
3     8.0
4     4.0
dtype: float64

In [77]:
small_sizes = shoe_sizes < 6  # True, False, False, False, True
small_sizes

0     True
1    False
2    False
3    False
4     True
dtype: bool

In [79]:
small_shoes = shoe_sizes[small_sizes]  # has values 5.5, 4
small_shoes

0    5.5
4    4.0
dtype: float64

In [80]:
# as one line
small_shoes = shoe_sizes[shoe_sizes < 6]

small_shoes

## DataFrames

In [82]:
name_series = pd.Series(['Ada','Bob','Chris','Diya','Emma'])
heights = [64, 74, 69, 69, 71] #Straight list or a series
weights = [135, 156, 139, 144, 152]

df = pd.DataFrame({'name':name_series, 
                   'height':heights, 
                   'weight':weights})
print(df)

    name  height  weight
0    Ada      64     135
1    Bob      74     156
2  Chris      69     139
3   Diya      69     144
4   Emma      71     152


Look at how nicely that formats...

This is one goal of programming. You spent an hour getting a5 to format that nicely. Someone else wrote a program that does it in seconds. 

In [89]:
##Now look if you don't print()
df

Unnamed: 0,name,height,weight
0,Ada,64,135
1,Bob,74,156
2,Chris,69,139
3,Diya,69,144
4,Emma,71,152


In [83]:
df.index

RangeIndex(start=0, stop=5, step=1)

In [85]:
df.columns

Index(['name', 'height', 'weight'], dtype='object')

In [93]:
df = df.assign(shoe_sizes_size = [11,5,6,5.5,8])
df

Unnamed: 0,name,height,weight,shoe_sizes_size
0,Ada,64,135,11.0
1,Bob,74,156,5.0
2,Chris,69,139,6.0
3,Diya,69,144,5.5
4,Emma,71,152,8.0


In [94]:
df.mean() ##Avg by column

height              69.4
weight             145.2
shoe_sizes_size      7.1
dtype: float64

In [98]:
#table of descriptive stats
descriptive_stats = round(df.describe(), 2)
descriptive_stats

Unnamed: 0,height,weight,shoe_sizes_size
count,5.0,5.0,5.0
mean,69.4,145.2,7.1
std,3.65,8.76,2.46
min,64.0,135.0,5.0
25%,69.0,139.0,5.5
50%,69.0,144.0,6.0
75%,71.0,152.0,8.0
max,74.0,156.0,11.0


Notice: 
This is treating every column. Not records. So these descriptive stats are "describing" columns (aka the series of data)

In [99]:
descriptive_stats['height']

count     5.00
mean     69.40
std       3.65
min      64.00
25%      69.00
50%      69.00
75%      71.00
max      74.00
Name: height, dtype: float64

In [101]:
#The Descriptive stats is a DF, so you can access it with bracket notation
descriptive_stats['height']['count']

5.0

In [106]:
#can also access with dot notation
descriptive_stats.height

count     5.00
mean     69.40
std       3.65
min      64.00
25%      69.00
50%      69.00
75%      71.00
max      74.00
Name: height, dtype: float64

In [108]:
#combine them
descriptive_stats.height['count']

5.0

In [110]:
#but not this strangely...
descriptive_stats.height.count

#...why? I don't know why they built it this way, but it is what it is
# why is the sky blue? 

<bound method Series.count of count     5.00
mean     69.40
std       3.65
min      64.00
25%      69.00
50%      69.00
75%      71.00
max      74.00
Name: height, dtype: float64>

In [117]:
##maybe I want to see some columns: 
df[['height', 'weight']]

Unnamed: 0,height,weight
0,64,135
1,74,156
2,69,139
3,69,144
4,71,152


In [119]:
##Can also access rows: 
df[df.height > 70]

Unnamed: 0,name,height,weight,shoe_sizes_size
1,Bob,74,156,5.0
4,Emma,71,152,8.0


In [120]:
##Bob and Emma are tall...!

In [121]:
##So what's going on here? 

df.height >70

0    False
1     True
2    False
3    False
4     True
Name: height, dtype: bool

In [122]:
#Ok, now get me all rows w/ True
df [ df.height >70]

Unnamed: 0,name,height,weight,shoe_sizes_size
1,Bob,74,156,5.0
4,Emma,71,152,8.0


DF Lookups

In [124]:
#This is like saying get me a series for the given row
#(for the row at the location I define)

descriptive_stats.loc['mean']

#So if I want to get all the avgs. that's how I could get it

height              69.4
weight             145.2
shoe_sizes_size      7.1
Name: mean, dtype: float64