<img src="https://i.imgur.com/6U6q5jQ.png"/>


# Data Collection in Python

<a id='beginning'></a>
This session pays attention to get data. In this situation, you can be confronted with a decision to collect data from repositories or similar sources, or collect your own data to answer an ad-hoc research question. The latter case will make you consider if you need a probabilistic or non-probabilistic design; which will also determine the next steps in your design.
In any case, you need to collect data to be read by R or Python, unless your data is not suitable for any kind of computational data processing. But in this unit, I am assuming it is. If you have collected your data, a popular choice to record your observations is an spreadsheet, maybe using Excel or GoogleDocs. If you have collected data from another party, you may also have spreadsheets, or more sophisticated files in particular formats, like SPSS or STATA. Maybe you decided to collect data from the web, and you may be dealing with XML or JSON formats; or simply text without much structure. Let me show you how to deal with the following cases:

1. [Propietary software.](#part1) 
2. [Ad-hoc collection.](#part2) 
3. [Use of APIs.](#part3) 
4. [Scraping webpages.](#part4) 


Remember that the location of your files is extremely important. If you have created a folder name "my project", your code should be in that folder, which I call sometimes the root folder,  and your data in another folder inside that root folder. In any case, you should become familiar with some important commands from the **os** package:

In [1]:
import os

The two more important uses are:

In [2]:
# where am I?
os.getcwd()

'C:\\Users\\DELL\\Documents\\GitHub\\Data-Collection'

If the file is in a folder inside your root folder, you simply write: 

In [3]:
import os

folder="data"
fileName="anes_timeseries_2012.dta"
fileToRead=os.path.join(folder,fileName)

The object _fileToRead_ has the right name of the path, because **os.path.join** creates a path using the elements between the parenthesis. Notice that if you are using Windows, a folder in "C" hard drive should be written like this: 
os.path.join('c:/','folder1', 'folder2'). Notice that you can write several folders, and path.join creates the right separator, but just for Windows you need that element ':/'. If you want to know the separator your computer is using, type this:

In [4]:
os.path.sep

'\\'

Let's turn our attention to the file acquisition process.


____


<a id='part1'></a>
## Collecting data from propietary software

Let's start with data from SPSS and STATA, very common in public policy schools. To work with these kind of files, we will simply use *pandas*. 


____


<a id='part1'></a>
## Collecting data from propietary software

Let's start with data from SPSS and STATA, very common in public policy schools. To work with these kind of files, we will simply use *pandas*. 

In [5]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [1]:
pip install pyarrow

Collecting pyarrow
  Using cached pyarrow-15.0.0-cp311-cp311-win_amd64.whl.metadata (3.1 kB)
Downloading pyarrow-15.0.0-cp311-cp311-win_amd64.whl (24.8 MB)
   ---------------------------------------- 0.0/24.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/24.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/24.8 MB 435.7 kB/s eta 0:00:57
   ---------------------------------------- 0.2/24.8 MB 1.7 MB/s eta 0:00:15
    --------------------------------------- 0.6/24.8 MB 4.3 MB/s eta 0:00:06
   - -------------------------------------- 1.0/24.8 MB 5.1 MB/s eta 0:00:05
   -- ------------------------------------- 1.3/24.8 MB 5.6 MB/s eta 0:00:05
   -- ------------------------------------- 1.7/24.8 MB 6.0 MB/s eta 0:00:04
   --- ------------------------------------ 2.1/24.8 MB 6.5 MB/s eta 0:00:04
   --- ------------------------------------ 2.3/24.8 MB 6.2 MB/s eta 0:00:04
   ---- ----------------------------------- 2.6/24.8 MB 6.0 MB/s eta 0:00:04
   ---- -

In [5]:
import pandas as pd

I using a file from the American National Election Studies (ANES). This is a rather big file, so let me select some variables ("libcpre_self","libcpo_self",a couple of question pre and post elections asking respondents to place themselves on a seven point scale ranging from ‘extremely liberal’ to ‘extremely conservative’) and create a data frame with them:

In [6]:
varsOfInterest=["libcpre_self","libcpo_self"]

Getting a Stata file into pandas is quite easy:

In [8]:
import os
folder="data"
fileName="anes_timeseries_2012.dta"
fileToRead=os.path.join(folder,fileName)
dataStata=pd.read_stata(fileToRead,columns=varsOfInterest)

In [9]:
dataStata.head()

Unnamed: 0,libcpre_self,libcpo_self
0,1. Extremely liberal,"-6. Not asked, unit nonresponse (no post-elect..."
1,1. Extremely liberal,2. Liberal
2,-2. Haven't thought much about this,2. Liberal
3,-2. Haven't thought much about this,-8. Don't know
4,2. Liberal,2. Liberal


Opening SPSS files in pandas requires you previously install pyreadstat:

In [10]:
# do you have it?
!pip show pyreadstat



In [11]:
# Set up the file location:
fileName="anes_timeseries_2012.sav"
fileToRead=os.path.join(folder,fileName)

# Open it: 
dataSpss=pd.read_spss(fileToRead) 

ImportError: Missing optional dependency 'pyreadstat'.  Use pip or conda to install pyreadstat.

In [None]:
dataSpss.head()

In [None]:
pip show openpyxl

In [None]:
# Set up the file location:
fileName="HDI.xlsx"
fileToRead=os.path.join(folder,fileName)

# Open it: 
dataExcel=pd.read_excel(fileToRead) 

In [None]:
dataExcel.info()

[Go to page beginning](#beginning)

_____

<a id='part2'></a>

## Collecting your ad-hoc data

Let me assume you have collected some data using Google Forms. The answers to your form are saved in an spreadsheet, which you should publish as a CSV file. Then, I can read it like this:

In [None]:
import pandas as pd
link='https://docs.google.com/spreadsheets/d/e/2PACX-1vRCHCDPx4NmYA5phchO2rZhZSPvHZjkF08E11i3gsjHCy4zVWc12IRGg8rMzDgpvIHCZQqGeqPFhWa6/pub?gid=692075096&single=true&output=csv'
fromGoogle = pd.read_csv(link)

# here it is:
fromGoogle

In [None]:
fromGoogle.info()

[Go to page beginning](#beginning)

-----

<a id='part3'></a>

## Collecting data from APIs

There are organizations, public and private, that have an open data policy that allows people to access their repositories dynamically. You can get that data in CSV format if available, but the data is always in  XML or JSON format, which are containers that store data in an *associative array* structure. Python's dictionaries are very useful in these situations, as they can keep the NOSQL structure better than data frames. Let me get the data about 9-1-1 Police reponses from Seattle:

In [None]:
# pip install sodapy

# make sure to install these packages before running:
# pip install pandas
# pip install sodapy

import pandas as pd
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.seattle.gov", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.seattle.gov,
#                  MyAppToken,
#                  username="user@example.com",
#                  password="AFakePassword")

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("kzjm-xkqj", limit=2000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

In [None]:
results_df.shape

In [None]:
results_df

[Go to page beginning](#beginning)

_____

<a id='part4'></a>

## Collecting data by scraping

We are going to get the data from a table from this [wikipage](https://en.wikipedia.org/wiki/List_of_freedom_indices)

In [None]:
pip show beautifulsoup4 html5lib lxml

In [None]:

# Location 
wikilink = "https://en.wikipedia.org/wiki/List_of_freedom_indices" 

wikiTables1=pd.read_html(wikilink)
wikiTables2=pd.read_html(wikilink,attrs={'class':'wikitable sortable'})
wikiTables3=pd.read_html(wikilink,match="Score")

In [None]:
#How many are there?
len(wikiTables1),len(wikiTables2),len(wikiTables3)

In [None]:
wikiTables2[0]

In [None]:
wikiTables3[0]

In [None]:
wikiTables2[0].info()

In [None]:
wikiTables2_bs=pd.read_html(wikilink,flavor='bs4',
                            attrs={'class':'wikitable sortable'})

In [None]:
wikiTables2_bs[0]

In [None]:
wikiTables2_bs[0].info()

In [None]:
freedomDF=wikiTables2_bs[0].copy()