**IMPORTANT:** make sure that you have the newest version of pandas (0.18 or higher) by running the following command (in the terminal):
* sudo pip3 install pandas --upgrade

# Pandas and Statsmodels
## Completing your data analysis workflow

We will use pandas and statsmodels today to show how the data analysis can be done completely in Python. 

Pandas is a package that allows us to work with datasets in a similar manner as in R (with dataframes) and, according to their own website, has the objective of becoming **the most powerful and flexible open source data analysis / manipulation tool available in any language**. 

It's up to you to decide whether that is (already) true or not, but this tutorial will demonstrate some of its capabilities. We also use statsmodels to do some of the statistical analysis, in a workflow integrated with Pandas.

## Importing the packages (or installing them, if needed)
First we import the pandas (usually imported as "pd"), and statsmodels/numpy. We also use matplotlib for some visualizations.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels as sm
import numpy as np

### Troubleshooting notes 

If you don't have these packages already, use pip3 to install them (using the terminal):
* sudo pip3 install numpy
* sudo pip3 install patsy
* sudo pip3 install pandas
* sudo pip3 install matplotlib
* sudo pip3 install statsmodels

If you get a gcc-error when installing any of the packages, run (in the terminal):
* sudo apt-get install python3-dev



In [None]:
%matplotlib inline

# Loading (or creating) data for Pandas

One of the key advantages of Pandas is how flexible it is in terms of types of data or files that it can load, and that it can save into. But first, we can also simply create the data from scratch.

For example, we can create a dataframe from lists.

In [None]:
names = ['John', 'Mary', 'Stefan', 'Cristina']
ages = [18, 20, 33, 19]

In [None]:
names_and_ages = list(zip(names, ages))

In [None]:
df = pd.DataFrame(data = names_and_ages, columns = ['name', 'age'])

In this case, the dataframe that we created was called df. We could have used any other name - although in several tutorial and in stackexchange, usually people call a generic dataframe given as example as df.

To see how the dataframe looks like, we can simply call it again (like any Python element).

In [None]:
df

We can also use a list of dictionaries to create a dataframe. The advantage here is that Pandas automatically recognizes the keys in the dictionaries as being the column names. This can be especially handy when working with JSON, and data acquired via APIs.

In [None]:
people = [
    {'name': 'John', 'age' : 20 , 'profession' : 'student', 'salary' : 10000},
    {'name': 'Mary', 'age' : 33 , 'profession' : 'journalist', 'salary' : 50000},
    {'name': 'Stefan', 'age' : 40, 'profession' : 'researcher', 'monthly_salary': 1500 },
    {'name': 'Cristina', 'age' : 18, 'profession' : 'webmaster', 'salary' : 25000},
    {'name': 'Joost', 'age' :22 , 'profession' : 'data scientist', 'salary' : 70000},
    {'name': 'Sandra', 'age' : 34, 'profession' : 'journalist', 'salary' : 55000},
    {'name': 'Marina', 'age' : 50, 'profession' : 'researcher', 'salary' : 45000 },
    
    
]

In [None]:
df = pd.DataFrame(people)

In [None]:
df

Note how Stefan has a NaN in the salary column, and everybody else has a NaN in the monthly_salary column. The dictionary above only had Stefan with the key "monthly_salary", while everybody else had only "salary". Pandas treated everything that it could not find as missing data (NaN, which stands for "Not a Number").

# Loading data into pandas

Pandas has several ways to load (or create) dataframes. You can read files (e.g., read_csv, read_excel, read_json), or even connect to databases. Likewise, you can save data in all these formats. For our tutorial, we will use a dataset of tweets collected from company accounts across several countries. 

The dataset is actually much larger than this (6M tweets and counting), but here we will use a random sample of 10K tweets.

In [None]:
tweets = pd.read_csv('brand_tweets_bdaca.csv')

One of the easiest ways to see what the dataset contains is to simply call the dataframe

In [None]:
tweets

The unnamed column does not seem to be that useful, so we can drop it. As a note, axis=1 means that we are dropping a column

In [None]:
tweets = tweets.drop(['Unnamed: 0'], axis=1)

We can also check which columns the dataset contains.

In [None]:
tweets.columns

As a note, the columns in the dataset mean the following:
* utweet_id: Unique ID of the tweet
* company: Company name
* screenname: Twitter name of the company account
* country: Country in which the company account is (mostly) active
* idv, mas & uai: Hofstede's classification for Individualism, Masculinity & Uncertainty Avoidance of the given country
* revenues: Revenues of the company
* statuses_count: Total number of tweets that the company account has published
* followers_count: Total number of followers that the company account has
* created_at: Date and time in which the tweet was created
* animated_gif, video, photo: Number of animated gifs, videos or photos published in the tweet
* has_hash: Whether the tweet has a hashtag in the text
* is_retweet: Whether the tweet is actually a retweet from someone else's account
* is_reply: Whether the tweet is a reply to another user
* rts: number of retweets that that tweet received

As a note, the actual text is not included (to reduce the file size)

## Saving the dataset

We made a small modification to the dataset, so we can save it. Pandas offers several formats to read/write data (see http://pandas.pydata.org/pandas-docs/stable/io.html). We'll use csv this time.

In [None]:
tweets.to_csv('brand_tweets_bdaca_corrected.csv')

# Exploring the dataset

Pandas allows for a lot of exploratory analyses to be done directly with built-in functions.

For example, getting descriptives of numerical variables

In [None]:
tweets.describe()

Or getting frequencies from categorical variables 

In [None]:
tweets['screenname'].value_counts()

You can also group by a given category, and get the descriptives

In [None]:
tweets.groupby(['company']).describe()

Or just get the means of a specific column.

In [None]:
tweets[['rts', 'company']].groupby(['company']).mean()

To be clearer, in the previous step, we actually selected two columns of the dataframe when we requested tweets[['rts', 'screenname']]. We can also simply create a smaller dataframe with just a few columns of the larger dataframe, and then later perform operations.

In [None]:
subdf1 = tweets[['screenname', 'company']]

In [None]:
subdf1

We can also filter the dataframe. Say that we are only interested in the tweets from Germany.

In [None]:
tweets[tweets['country'] == 'Germany']

## Visualizing the data

We can also run some quick data visualizations. For more information, see: http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html

In [None]:
tweets.plot.scatter(x='statuses_count', y='followers_count', c='blue')


In [None]:
tweets.hist(column='statuses_count', alpha=.5, bins=10)

For some of the visualizations, we can ask several plots to be done at the same time. For example, we may want a histogram of statuses_count for each country.

In [None]:
tweets.hist(column='statuses_count', by=['country'], alpha=.5, bins=10, figsize=(10,7))

# Running statistical tests

While you could export the dataframe to CSV (or even to a STATA format) and to the statistical analysis elsewhere, Pandas & Statsmodels/Numpy allow you to do a lot of it in the same workflow. 

The examples below are just very small set of what these packages can do. To know more, check:
* Pandas Computational Tools: http://pandas.pydata.org/pandas-docs/stable/computation.html
* Statsmodels documentation: http://statsmodels.sourceforge.net/


In [None]:
tweets.corr()

We can also do T-Tests. The results return the test statistic, p-value, and the degrees of freedom. 

Here, notice that we are using the ability to filter a dataframe (e.g., *tweets[tweets['country'] == 'Brazil']*  creates a dataframe only with Brazil as a country; adding *['rts']* at the end selects only the column for rts).

In [None]:
from statsmodels.stats.weightstats import ttest_ind

In [None]:
ttest_ind(tweets[tweets['country'] == 'Brazil']['rts'], 
                               tweets[tweets['country'] == 'Netherlands']['rts'])

We can also do an OLS regression. In order to do so, we need to define a model and then run it. When defining the model, you create the equation in the following manner:
* First you include your dependent variable, followed by the ~ sign
* Then you include the independent variables (separated by the + sign)

In [None]:
from statsmodels.formula.api import ols

In [None]:
model = 'rts ~ photo + video + followers_count'

In [None]:
regression = ols(formula=model, data=tweets).fit()

In [None]:
print(regression.params)

In [None]:
print(regression.summary())

You can also run a series of models in a for loop, for example.

In [None]:
base_model = 'rts ~ followers_count'
independent_variables = ['statuses_count', 'uai', 'idv', 'mas', 'photo', 'video', 'animated_gif']

In [None]:
models = []
models.append(sm.formula.api.ols(formula=base_model, data=tweets).fit())

In [None]:
i = 0
for iv in independent_variables:
    base_model += '+ ' + iv
    models.append(sm.formula.api.ols(formula=base_model, data=tweets).fit())


In [None]:
for model in models:
    print(model.summary())
    print('\n\n')

# More information

If you are interested in using Python for your statistical analyses, you may want to consult:
* 10-minute video showcasing the possibilities of Pandas: https://vimeo.com/59324550
* Doing time series analyses in Pandas: http://earthpy.org/pandas-basics.html & http://statsmodels.sourceforge.net/stable/vector_ar.html#var
* And doing a deep dive in the Pandas documentation :-) http://pandas.pydata.org/