# Extracting Data from Twitter CSV using Pandas

Using Pandas we will create a DataFrame of twitter data `(tweets.csv)`. 

* First, we'll proceed to write a function that retuns a dictionary containing `languages` as keys and the number of times a tweet was written in a given language as values. 

* Next, we will go further and write a more general function that processes a DataFrame and returns a dictionary with counts of occurrences in `ANY column` at all. And that by default will process the column containing languages.

* We will then generalize the previous function so that we can pass the function a Dataframe and `ANY number of column names` to perform the computation on an `arbitrary` number of columns.


## Importing Data

In [4]:
# Importing dependencies

import pandas as pd

In [5]:
# Import Twitter data as DataFrame: df
tweets_df = pd.read_csv('tweets.csv')


## Visualizing Data

* Usando el método `.head ()` para verificar los tipos de datos, número de filas y más:

In [6]:
tweets_df.head()

Unnamed: 0,contributors,coordinates,created_at,entities,extended_entities,favorite_count,favorited,filter_level,geo,id,...,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,text,timestamp_ms,truncated,user
0,,,Tue Mar 29 23:40:17 +0000 2016,"{'hashtags': [], 'user_mentions': [{'screen_na...","{'media': [{'sizes': {'large': {'w': 1024, 'h'...",0,False,low,,714960401759387648,...,,,0,False,"{'retweeted': False, 'text': "".@krollbondratin...","<a href=""http://twitter.com"" rel=""nofollow"">Tw...",RT @bpolitics: .@krollbondrating's Christopher...,1459294817758,False,"{'utc_offset': 3600, 'profile_image_url_https'..."
1,,,Tue Mar 29 23:40:17 +0000 2016,"{'hashtags': [{'text': 'cruzsexscandal', 'indi...","{'media': [{'sizes': {'large': {'w': 500, 'h':...",0,False,low,,714960401977319424,...,,,0,False,"{'retweeted': False, 'text': '@dmartosko Cruz ...","<a href=""http://twitter.com"" rel=""nofollow"">Tw...",RT @HeidiAlpine: @dmartosko Cruz video found.....,1459294817810,False,"{'utc_offset': None, 'profile_image_url_https'..."
2,,,Tue Mar 29 23:40:17 +0000 2016,"{'hashtags': [], 'user_mentions': [], 'symbols...",,0,False,low,,714960402426236928,...,,,0,False,,"<a href=""http://www.facebook.com/twitter"" rel=...",Njihuni me Zonjën Trump !!! | Ekskluzive https...,1459294817917,False,"{'utc_offset': 7200, 'profile_image_url_https'..."
3,,,Tue Mar 29 23:40:17 +0000 2016,"{'hashtags': [], 'user_mentions': [], 'symbols...",,0,False,low,,714960402367561730,...,7.149239e+17,7.149239e+17,0,False,,"<a href=""http://twitter.com/download/android"" ...",Your an idiot she shouldn't have tried to grab...,1459294817903,False,"{'utc_offset': None, 'profile_image_url_https'..."
4,,,Tue Mar 29 23:40:17 +0000 2016,"{'hashtags': [], 'user_mentions': [{'screen_na...",,0,False,low,,714960402149416960,...,,,0,False,"{'retweeted': False, 'text': 'The anti-America...","<a href=""http://twitter.com/download/iphone"" r...",RT @AlanLohner: The anti-American D.C. elites ...,1459294817851,False,"{'utc_offset': -18000, 'profile_image_url_http..."


In [7]:
print("Column Headers: ", list(tweets_df), sep="\n")

Column Headers: 
['contributors', 'coordinates', 'created_at', 'entities', 'extended_entities', 'favorite_count', 'favorited', 'filter_level', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'place', 'possibly_sensitive', 'quoted_status', 'quoted_status_id', 'quoted_status_id_str', 'retweet_count', 'retweeted', 'retweeted_status', 'source', 'text', 'timestamp_ms', 'truncated', 'user']


* Using the method `.info ()` to verify the type of data, number of rows and more:

In [8]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 31 columns):
contributors                 0 non-null float64
coordinates                  0 non-null float64
created_at                   100 non-null object
entities                     100 non-null object
extended_entities            20 non-null object
favorite_count               100 non-null int64
favorited                    100 non-null bool
filter_level                 100 non-null object
geo                          0 non-null float64
id                           100 non-null int64
id_str                       100 non-null int64
in_reply_to_screen_name      11 non-null object
in_reply_to_status_id        8 non-null float64
in_reply_to_status_id_str    8 non-null float64
in_reply_to_user_id          11 non-null float64
in_reply_to_user_id_str      11 non-null float64
is_quote_status              100 non-null bool
lang                         100 non-null object
place                       

In [9]:
tweets_df[['lang']]

Unnamed: 0,lang
0,en
1,en
2,et
3,en
4,en
5,en
6,en
7,en
8,en
9,en


## Extracting the Number of ocurrences of each Language

* We will define a function that will go through each entry under a column of a Dataframe and count how many times that entry appears

* This function has two **`parameters`** : `df` for the Dataframe and the string `col_name` for the column name 

* We will then extract the number of times each language appears under column `'lang'` by passing to the function the **`arguments`** : `tweets_df` and `'lang'`.

In [10]:
def count_entries1(df, col_name):
    """ Returns a dictionary with counts of ocurrences
        as value for each key. """
    
    # Initialize an empty dictionary: langs_count
    langs_count ={}

    # Extract 'lang' (language) column from DataFrame: col
    col = df[col_name]

    # Iterate over lang column in DataFrame
    for entry in col:

        # If the language is in langs_count, add 1 to it:
        if entry in langs_count.keys():
            langs_count[entry] += 1

        # Else add the language t langs_count, set the value to 1
        else: 
            langs_count[entry] = 1
    
    return langs_count


* Assigning the result to variable `result` and printing the resulting dictionary with requested information

In [11]:
result = count_entries1(tweets_df, 'lang')

print(result)

{'en': 97, 'et': 1, 'und': 2}


## A more generalized function - Part 1

Now, we will write a more generalized version of the previous function that also processes a DataFrame **BUT** this time returns a dictionary with counts of occurrences in `ANY column`, with a `default argument` being the column containing languages (`'lang'`).

In [12]:
def count_entries2(df, col_name='lang'):
    """ Returns a dictionary with counts of occurrences as
        as value for each key. """
    
    # Initialize an empty dictionary: cols_count
    cols_count ={}
    
    # Extract column from DataFrame: col
    col = df[col_name]
    
    # Iterate over the column in DataFrame
    for entry in col:
        
        # If entry is in cols_count, add 1
        if entry in cols_count.keys():
            cols_count[entry] += 1
        
        # Else add the entry to cols_count, set the value to 1
        else:
            cols_count[entry] = 1
    
    def print_items(dic):
        """ Prints out the keys and values of a dictionary
            separated by a colon ':' """
        
        print("\nNUMBER OF ENTRIES IN COLUMN: '{}'\n".format(col_name.upper()))

        for key, value in cols_count.items():            
            print("{} : {} entries".format(key, value))
        
    # Return the cols_count dictionary
    return print_items(cols_count) 


In [13]:
lang_result = count_entries2(tweets_df)
source_result = count_entries2(tweets_df, 'source' )

print(lang_result)
print(source_result)


NUMBER OF ENTRIES IN COLUMN: 'LANG'

en : 97 entries
et : 1 entries
und : 2 entries

NUMBER OF ENTRIES IN COLUMN: 'SOURCE'

<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a> : 24 entries
<a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a> : 1 entries
<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> : 26 entries
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> : 33 entries
<a href="http://www.twitter.com" rel="nofollow">Twitter for BlackBerry</a> : 2 entries
<a href="http://www.google.com/" rel="nofollow">Google</a> : 2 entries
<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a> : 6 entries
<a href="http://linkis.com" rel="nofollow">Linkis.com</a> : 2 entries
<a href="http://rutracker.org/forum/viewforum.php?f=93" rel="nofollow">newzlasz</a> : 2 entries
<a href="http://ifttt.com" rel="nofollow">IFTTT</a> : 1 entries
<a href="http://www.myplume.com/" rel

## A more generalized function - Part 2

We will generalize the function even further by allowing the user to pass it a flexible argument (`*args`), that is, in this case, as many column names as the user would like. 


In [14]:
def count_entries3(df, *args):
    """ Returns a dictionary with counts of occurrences as
        as value for each key. """
    
    # Initialize an empty dictionary: cols_count
    cols_count ={}
    
    # Iterate over column names in args
    for col_name in args:
        
        # Extract column from DataFrame: col
        col = df[col_name]
    
        # Iterate over the column in DataFrame
        for entry in col:

            # If entry is in cols_count, add 1
            if entry in cols_count.keys():
                cols_count[entry] += 1

            # Else add the entry to cols_count, set the value to 1
            else:
                cols_count[entry] = 1
                
    for key, value in cols_count.items():            
        print("{} : {} entries".format(key, value))


In [15]:

new_result = count_entries3(tweets_df, 'lang', 'source')

print(new_result)

en : 97 entries
et : 1 entries
und : 2 entries
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a> : 24 entries
<a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a> : 1 entries
<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> : 26 entries
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> : 33 entries
<a href="http://www.twitter.com" rel="nofollow">Twitter for BlackBerry</a> : 2 entries
<a href="http://www.google.com/" rel="nofollow">Google</a> : 2 entries
<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a> : 6 entries
<a href="http://linkis.com" rel="nofollow">Linkis.com</a> : 2 entries
<a href="http://rutracker.org/forum/viewforum.php?f=93" rel="nofollow">newzlasz</a> : 2 entries
<a href="http://ifttt.com" rel="nofollow">IFTTT</a> : 1 entries
<a href="http://www.myplume.com/" rel="nofollow">Plume for Android</a> : 1 entries
None


## Adding Error Handling functionality

* We write a `lambda function` and use `filter()` to select only `retweets`, that is, tweets that begin with the string `'RT'`
* To get the first 2 characters in a tweet `x`, we use `x[0:2]`
* To check equality, use a Boolean filter with `==`

In [28]:
# Select retweets from the Twitter DataFrame

result = filter( lambda x: x[:2]=='RT' , tweets_df['text']) 

# Create list from filter object result: res_list
res_list = list(result)

# Print all retweets in res_list
for tweet in res_list:
    print(tweet)
    
    
    

RT @bpolitics: .@krollbondrating's Christopher Whalen says Clinton is the weakest Dem candidate in 50 years https://t.co/pLk7rvoRSn https:/…
RT @HeidiAlpine: @dmartosko Cruz video found.....racing from the scene.... #cruzsexscandal https://t.co/zuAPZfQDk3
RT @AlanLohner: The anti-American D.C. elites despise Trump for his America-first foreign policy. Trump threatens their gravy train. https:…
RT @BIackPplTweets: Young Donald trump meets his neighbor  https://t.co/RFlu17Z1eE
RT @trumpresearch: @WaitingInBagdad @thehill Trump supporters have selective amnisia.
RT @HouseCracka: 29,000+ PEOPLE WATCHING TRUMP LIVE ON ONE STREAM!!!

https://t.co/7QCFz9ehNe
RT @urfavandtrump: RT for Brendon Urie
Fav for Donald Trump https://t.co/PZ5vS94lOg
RT @trapgrampa: This is how I see #Trump every time he speaks. https://t.co/fYSiHNS0nT
RT @trumpresearch: @WaitingInBagdad @thehill Trump supporters have selective amnisia.
RT @Pjw20161951: NO KIDDING: #SleazyDonald just attacked Scott Walker for NOT RAISI