<center><img src="http://alacip.org/wp-content/uploads/2014/03/logoEscalacip1.png" width="500"></center>


<center> <h1>Curso: Introducción al Python</h1> </center>

<br></br>

* Profesor:  <a href="http://www.pucp.edu.pe/profesor/jose-manuel-magallanes/" target="_blank">Dr. José Manuel Magallanes, PhD</a> ([jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe))<br>
    - Profesor del **Departamento de Ciencias Sociales, Pontificia Universidad Católica del Peru**.<br>
    - Senior Data Scientist del **eScience Institute** and Visiting Professor at **Evans School of Public Policy and Governance, University of Washington**.<br>
    - Fellow Catalyst, **Berkeley Initiative for Transparency in Social Sciences, UC Berkeley**.
    
    
## Parte 3:  Carga de datos en Python

<a id='beginning'></a>
This session pays attention to get data. In this situation, you can be confronted with a decision to collect data from repositories or similar sources, or collect your own data to answer an ad-hoc research question. The latter case will make you consider if you need a probabilistic or non-probabilistic design; which will also determine the next steps in your design.

In any case, you need to collect data to be read by R or Python, unless your data is not suitable for any kind of computational data processing. But in this unit, I am assuming it is. If you have collected your data, a popular choice to record your observations is an spreadsheet, maybe using Excel or GoogleDocs. If you have collected data from another party, you may also have spreadsheets, or more sophisticated files in particular formats, like SPSS or STATA. Maybe you decided to collect data from the web, and you may be dealing with XML or JSON formats; or simply text without much structure. Let me show you how to deal with the following cases:

1. [Propietary/common software.](#part1) 
2. [Collecting your own.](#part2) 
3. [Use of APIs.](#part3) 
4. [Scraping webpages.](#part4) 


Remember that the location of your files is extremely important. If you have created a folder name "my project", your code should be in that folder, which I call sometimes the root folder,  and your data in another folder inside that root folder. In any case, you should become familiar with some important commands from the **os** package:

In [None]:
import os

The two more important uses are:

In [None]:
# where am I?
os.getcwd()

The command above gave you your current location, if it is not what you expected, you can change it with another command:

In [None]:
os.chdir()

You have to include the path to the folder you want between the parenthesis. 

Becareful, you need to follow a similar pattern than the one obtained with _os.getcwd()_; that is, see if the folders in the path are using __\\__, __\\\__, __/__, __//__ to separate the folders. This difference depends on the type of computer you have. Remember that a path has to be written as a string, that is, in between '' or "".

You need to change your root folder location once, if needed; but you do not use _ch.dir()_ again for every file you read. If the file is in a folder inside your root folder, you simply write: 

In [None]:
import os

folder="data"
fileName="anes_timeseries_2012.dta"
fileToRead=os.path.join(folder,fileName)

The object _fileToRead_ has the right name of the path, because **os.path.join** creates a path using the elements between the parenthesis. Notice that if you are using Windows, a folder in "C" hard drive should be written like this: 
os.path.join('c:/','folder1', 'folder2'). Notice that you can write several folders, and path.join creates the right separator, but just for Windows you need that element ':/'. If you want to know the separator your computer is using, type this:

In [None]:
os.path.sep

Let's turn our attention to the file acquisition process.


____


<a id='part1'></a>
## Collecting data from propietary / common software

Let's start with data from STATA, very common in polSci and public policy schools. To work with these kind of files, we will simply use *pandas*. 

In [None]:
import pandas as pd

I using a file from the American National Election Studies (ANES). This is a rather big file, so let me select some variables ("libcpre_self","libcpo_self",a couple of question pre and post elections asking respondents to place themselves on a seven point scale ranging from ‘extremely liberal’ to ‘extremely conservative’) and create a data frame with them:

In [None]:
varsOfInterest=["libcpre_self","libcpo_self"]

Getting a Stata file into pandas is quite easy:

In [None]:
import os
folder="data"
fileName="anes_timeseries_2012.dta"
fileToRead=os.path.join(folder,fileName)
dataStata=pd.read_stata(fileToRead,columns=varsOfInterest)

In [None]:
dataStata.head()

Getting an Excel file is also straightforward:

In [None]:
fileName="ElectricBus.xlsx"
fileToRead=os.path.join(folder,fileName)
dataExcel=pd.read_excel(fileToRead,0) # no need for '0'
dataExcel.head()

CSV files are as easy:

In [None]:
fileName="mealSeattle.csv"
fileToRead=os.path.join(folder,fileName)
dataCSV=pd.read_csv(fileToRead)
dataCSV.head()

Of course, it is more fun opening several of these files.

In [None]:
from glob import glob

pattern='interv*.csv'
where='data'
fileNames = glob(pattern)
for filename in glob(os.path.join(where, pattern)):
    print (filename)

We had access to the names, the let's make a list of files:

In [None]:
allFiles=[]
for filename in glob(os.path.join(where, pattern)):
    allFiles.append(pd.read_csv(filename))

In [None]:
#do we have the data?
allFiles[0]

Let's concatenate the first 3 files:

In [None]:
pd.concat(allFiles[0:4],ignore_index=True)

In [None]:
#storing what we did:
newOneFile=pd.concat(allFiles[0:4],ignore_index=True)

Let's merge with last file:

In [None]:
newOneFile.merge(allFiles[4],left_on='interview', right_on='interview') # no real need if same keys

In [None]:
# saving the result
newOneFile.merge(allFiles[4]).to_csv('data/newOneFile.csv')

[Go to page beginning](#beginning)

_____

<a id='part2'></a>

## Collecting your ad-hoc data

Let me assume you are collecting some data using [this](https://goo.gl/forms/f4m4zv41xBh5osrw1) **GoogleForm**. The answers to your form are saved in an spreadsheet, which you should publish as a CSV file. 

Then, you can read it like this:

In [None]:
import pandas as pd
link=''
myData = pd.read_csv(link)

# here it is:
myData

[Go to page beginning](#beginning)

-----

<a id='part3'></a>

## Collecting data from APIs

There are organizations, public and private, that have an open data policy that allows people to access their repositories dynamically. You can get that data in CSV format if available, but the data is always in  XML or JSON format, which are containers that store data in an *associative array* structure. Python's dictionaries are very useful in these situations, as they can keep the NOSQL structure better than data frames. Let me get the data about 9-1-1 Police reponses from Seattle:

In [None]:
import requests

# where is it online?
url = "https://data.seattle.gov/resource/pu5n-trf4.json"

# Go for the data:
response = requests.get(url)

# If we got the data:
if response.status_code == 200:
    data911 = response.json()

In [None]:
len(data911)

In [None]:
# You can turn it easily into a pandas data frame:

data911DF=pd.DataFrame(data911)

In [None]:
# here you are...
data911DF.head()

### The case of Twitter

Social media offers rich textual information. In particular, **Twitter** offers an API (registration required) to get the data they have.

Follow these steps:
1. If you do not have a Twitter account, create one; use your Twitter username and password to access this [link](https://apps.twitter.com/).
2. When you are there **Create a new App**. Just complete the basic info requested.
3. When the App is created, look for the _Keys and Access Tokens_.
4. Open a text editor (a simple one) and create a dictionary like this:  
{"consumer_key": "aaa", "access_token_secret": "bbb", "consumer_secret": "ccc", "access_token": "ddd"}
5. Save it as keysAPI.txt, it should be in the root folder for this course (where your codes are).

If you did everything ok, the next codes will work:


In [None]:
import json

# get the security info from file
keysAPI = json.load(open('data/keysAPI.txt','r'))

Verify if you have **tweepy**. You may need to install it via **pip**.

In [None]:
import tweepy

# recovering security info
consumer_key = keysAPI['consumer_key']
consumer_secret = keysAPI['consumer_secret']
access_token = keysAPI['access_token']
access_token_secret = keysAPI['access_token_secret']

In [None]:
# using security info:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api=tweepy.API(auth, wait_on_rate_limit=True,wait_on_rate_limit_notify=True,parser=tweepy.parsers.JSONParser())

In [None]:
# getting the tweets from a user:

tweets = api.user_timeline(screen_name = 'PepeMujicaDice', count = 100, include_rts = False)

Let's see what we have:

In [None]:
tweets

In [None]:
type(tweets)

In [None]:
type(tweets[0])

In [None]:
aTweet=tweets[0]

for field in aTweet.keys():
    print (field)

In [None]:
aTweet['text']

In [None]:
aTweet['created_at']

In [None]:
# transform the dict into a DF in pandas
mujicaTweets=pd.DataFrame({'textTweet':[t['text'] for t in tweets]})
mujicaTweets.head()

[Go to page beginning](#beginning)

_____

<a id='part4'></a>

## Collecting data by scraping

We are going to get the data from a table from this [wikipage](https://en.wikipedia.org/wiki/List_of_freedom_indices)

In [None]:
from requests import get
from bs4 import BeautifulSoup as BS

# Location 
wiki="https://en.wikipedia.org/wiki/" 
link = "List_of_freedom_indices" 

wikiLink=wiki+link
# avoid rejection from server
identification = {"User-Agent":"Mozilla/5.0"}
# contact server
wikiPage =get(wikiLink , headers=identification)
# BS gets wikipedia page as html
wikiSoup =BS(wikiPage.content ,"html.parser")
# BS extracts the whole table (it is html) 
wikiTables=wikiSoup.findAll("table",{"class":"wikitable sortable"})

In [None]:
#How many are there?
len(wikiTables)

In [None]:
# So, I just pick the one I need:
wikiTable=wikiTables[0]

In [None]:
# what do you have:
wikiTable

The table is there, but in HTML format. In general, our table is composed of ROWS, so this command will get every row from the table:

In [None]:
allRows=wikiTable.find_all('tr') #'tr' stands for table row, it is a TAG in HTML.

In [None]:
# headersHtml is simply the first row of the table
headersHtml=allRows[0]

In [None]:
# and we have:
headersHtml # this is ONE element from allRows 

You just saw the headers, but they are still in HTML; let me use the tag 'th' (table header) to get those elements in _headersHtml_:

In [None]:
headersHtml.find_all('th')

You see a list of elements, each between the tags 'th' (using < or >). Each element in the list has the text of the header (and other elements). We just need the text, so:

In [None]:
# headersList is a list with each headers' TEXT element:
headersList=[header.get_text() for header in headersHtml.find_all('th')]

In [None]:
# Voilà les titres:
headersList

The same process should be followed to get the data.

In [None]:
# This should be the data:
rowsHtml=allRows[1:]  #... [1:] is omitting the Headers

In [None]:
# let's see one of these:
rowsHtml[0]

Each of the table cell uses the **td** TAG, but we will not recover ONE row, but several rows, so we need to adapt the previous code for the headers in this case.

In [None]:
# some python beauty:
# using 'td' 
rowsList=[[cell.get_text() for cell in row.find_all('td')] for row in rowsHtml]


In [None]:
rowsList[0:3] # a list of lists

In [None]:
# Data frame creation
import pandas as pd

# making a data frame from list of lists!
pd.DataFrame(data=rowsList , columns=headersList)

_____

**AUSPICIO**: 

* El desarrollo de estos contenidos ha sido posible gracias al grant del Berkeley Initiative for Transparency in the Social Sciences (BITSS) at the Center for Effective Global Action (CEGA) at the University of California, Berkeley


<center>
<img src="https://www.bitss.org/wp-content/uploads/2015/07/bitss-55a55026v1_site_icon.png" style="width: 200px;"/>
</center>

* Este curso cuenta con el auspicio de:


<center>
<img src="https://www.python.org/static/img/psf-logo@2x.png" style="width: 500px;"/>
</center>



**RECONOCIMIENTO**


EL Dr. Magallanes agradece a la Pontificia Universidad Católica del Perú, por su apoyo en la participación en la Escuela ALACIP.

<center>
<img src="https://dci.pucp.edu.pe/wp-content/uploads/2014/02/Logotipo_colores-290x145.jpg" style="width: 400px;"/>
</center>


El autor reconoce el apoyo que el eScience Institute de la Universidad de Washington le ha brindado desde el 2015 para desarrollar su investigación en Ciencia de Datos.

<center>
<img src="https://escience.washington.edu/wp-content/uploads/2015/10/eScience_Logo_HR.png" style="width: 500px;"/>
</center>

<br>
<br>