### GESIS Fall Seminar in Computational Social Science 2022
### Introduction to Computational Social Science with Python
# Day 3-1: Handling Social Data

## Overview

* Ethics of data access
* Reading and writing common file types
* More complex data types: time, dates, Unicode, etc.


## Ethics of data access
* Ethics vs the law
* T&Cs and robots.txt
* Database dumps, APIs, Scraping

### Ethics vs the law
* This does not constitute legal advice, you are responsible for your own computer use.
* The international and often outdated nature of Internet law makes this murky territory.
* It's best to provide identifying information so you can be contacted in case of issues.
* Request data at a reasonable rate, and take only what you need.
* Automated data collection may be easy, but remain respectful of public/private data, copyright, and institution ethics procedures.
* Sometimes ethics and the site terms and conditions can be at odds with each other.
* Terms and conditions might be scary, but are often [not legally enforcible](https://www.eff.org/deeplinks/2022/04/scraping-public-websites-still-isnt-crime-court-appeals-declares). Though you may still be blocked or restricted from site access.

### T&Cs and robots.txt
* Policies for scraping and API use can typically be found in the website Terms & Conditions and the robots.txt file.
* Be more respectful of smaller websites (maybe get in contact).
* Be more wary of larger company websites (especially with large projects).
     - That said, there are usually lots of github repos and blogs from people who have tried scraping these platforms. Learn from their experience.


### Database dumps, APIs, Scraping
* Sometimes, websites make full/partial dumps of their data available. These are a useful first step to explore data.
* As a next step, consider API usage. This is an official channel which can be controlled and monitored by the platform.
* If neither of these are available or you need more data, turn to scraping (effectively simulating a user browsing the site).

## 🏋️‍♀️ PRACTICE

Q1: Find the policies on scraping data and API usage for a popular online platform. Write a short paragraph on what you find, considering the following questions:
* What data collection is permitted? Under what conditions? How frequently?
* How would the data acquired compare to what a typical user sees?
* How would the data acquired compare to what the platform has access to?
* Why are there discrepancies between the available data and what the user/platform can access?
* Any other surprises?



## Reading and writing common file types
* In order to read/write files we need to tell Python:
    - Where the file is on our computer.
    - How to read/write it.
    - To open the file.
    - What to read/write.
    - To close the file.
* The most common file types you will read/write are .txt, .csv, .json, and .pickle (and variants thereof).

In [None]:
# Use the os module to get the current working directory
import os

print(os.getcwd())
# We can use 'relative' file paths from here to read files (see below)

# If we want to read a file from outside this folder, we need to specify the full filepath

### How to read/write a file
We can open text files with several modes:
* "r" – read a file
* "r+" – read and write to a file
* "w" – write to a file (creates new file / overwrites existing content)
* "w+" – read and write to a file (creates new file / overwrites existing content)
* "a" – appending to an already existing file
* "a+" – append to a file after reading

![fopen](figs/fopen.png "fopen")

In [None]:
# File open/read/close syntax

# Read full file as a string
f = open('data/demo.txt', 'r')
alltxt = f.read()
f.close()

# Read first line(s) as a string
f = open('data/demo.txt', 'r')
firstline = f.readline()
secondline = f.readline()
f.close()

f = open('data/demo.txt', 'r')
alllines = f.readlines()
f.close()

print(alltxt)
print()
print(firstline)
print(secondline)
print()
print(alllines)


In [None]:
# It can be easy to forget the f.close() command, so more common syntax is:

# Reading
with open('data/demo.txt','r') as f:
    alllines = f.readlines()

# This opens, reads, then automatically closes the file

print(alllines)

In [None]:
linestowrite = ['first line\n', 'second line\n', '\n', 'My name is: \n']

# Writing
with open('myfirstfile.txt','w') as f:
    f.writelines(linestowrite)   

In [None]:
texttoappend = 'Patrick Gildersleve\n@pgildersleve'

# Appending
with open('myfirstfile.txt','a') as f:
    f.write(texttoappend)

Many files are text encoded and can be read/written in a similar way by the basic file read syntax. In practice there are often dedicated modules for more complex filetypes, but for demonstration purposes:

In [None]:
# Reading a .ipynb file
with open('3-2-scraping-data.ipynb','r') as f:
    ipynb = f.read()

print(ipynb[:500])

In [None]:
# Reading a .html file

with open('data/PL_table.html','r') as f:
    html = f.read()

print(html[:1000])

### CSV
* Comma-Separated Values - A text file format where fields are separated by commas and rows by newlines.
* Frequently used to store tabular data.

In [None]:
# Reading a .csv file

with open('data/WTA_2016.csv','r') as f:
    csvdata = f.read()

print(csvdata[:1000])

In [None]:
# Manually parse the csv data

# split each line
csvlines = csvdata.split('\n')
for l in csvlines[:5]:
    print(l)

# split at each comma
csvtable = [x.split(',') for x in csvlines]
for r in csvtable[:5]:
    print(r)

In [None]:
# Alternatively, the csv module (and functions built into many other packages) handles all the reading and parsing:

import csv

with open('data/WTA_2016.csv','r') as f:
    reader = csv.reader(f)
    csvtable = [list(x) for x in reader]
 
for r in csvtable[:5]:
    print(r)

### JSON
* "JavaScript Object Notation" - an open file format used to store structured attribute/value pairs and arrays in a readable text format.
* For Python, it is useful for storing structured data like dictionaries and lists.
* Frequently encountered when using web APIs.

In [None]:
import json

fruit = {'banana':'yellow', 'orange':'orange', 'apple':['red', 'green']}

with open('fruit.json', 'w') as f:
    json.dump(fruit, f) # json.dump to write data


In [None]:
with open('fruit.json', 'r') as f:
    readfruit = json.load(f)  # json.load to read data
    
print(readfruit)
print(type(readfruit)) # automatically reads as a dict

### Binary files
We can also read/write binary encoded files, such as 'pickle' files. These are useful for storing Python objects. In this case, Python does not try to convert the bytes it reads to string characters (because they're not string characters), and lets the module do the decoding.

We can open binary files with several modes:
* "rb" – read a file
* "rb+" – read and write to a file
* "wb" – write to a file (creates new file / overwrites existing content)
* "wb+" – read and write to a file (creates new file / overwrites existing content)
* "ab" – appending to an already existing file
* "ab+" – append to a file after reading

In [None]:
import pickle

fruit = {'banana':'yellow', 'orange':'orange', 'apple':['red', 'green']}

# Write the file using pickle 
with open('fruit.pickle', 'wb') as f:
    pickle.dump(fruit, f)

In [None]:
# Read the file using pickle 
with open('fruit.pickle', 'rb') as f:
    readfruit = pickle.load(f)
    
print(readfruit)
print(type(readfruit))

[`glob`](https://docs.python.org/3/library/glob.html) is a useful module for bulk selecting files to be read.

In [None]:
import glob

# '*' acts as a wildcard so you can find all files that match a pattern
filelist = glob.glob('data/WTA_*.csv')
print(filelist)

In [None]:
# Read the files in the filelist

alltext = ''
for filename in sorted(filelist):
    with open(filename, 'r') as f:
        header = f.readline() # reads the header
        text = f.read() # starts reading after the header
        alltext += text
        
print(alltext[:1000])

## 🏋️‍♀️ PRACTICE

In [None]:
# Q2: Create a dictionary with keys 0-10, each element should be a list of the keys raised to powers 0-10
# e.g. 2:[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
# Save this dictionary to a pickle


In [None]:
# Q3: Read the pickle file you just created. Add items (keys + power list) for the keys 11-20 to the dict.
# Save the result as a json file and use your file browser to check you can view it in a text editor.


In [None]:
# Q4: Write a program which takes a student register and saves it as register.txt
# The first line should just be "Name", and written immediately upon file creation.
# Subsequent lines should take user input, and append the inputted name to the file.
# Remember to add a mechanism to stop inputting names
# The program should make use of both the 'w+' file mode (when creating the file),
# and the 'a' file mode (when appending names).


## More complex data types: time, dates, Unicode, etc.

We have already seen basic data types in Python such as `int`, `float`, `str`, etc... There are several other modules with associated data types that are frequently used in Computational Social Science in Python.

### Time
* The 'Epoch', when computer time starts, is on January 1st, 1970, 00:00 UTC.
* Computers measure time by the number of seconds since this date.
* We can use the `time` module to measure the time, pause code, 
* Not covered here, but time and date can get complex very quickly: timezones, leap years, leap seconds, ...

In [None]:
import time

timenow = time.time()
print('It has been %f seconds since January 1st, 1970, 00:00 UTC' %timenow)
print('It has been %f years since January 1st, 1970, 00:00 UTC' %(timenow/(60*60*24*365)))

print('The time is: ' + time.ctime())

In [None]:
# Use time.sleep(x) to pause for x seconds (useful when rate limiting when scraping)
for i in range(5,0,-1):
    print(i)
    time.sleep(1)
print('Liftoff!')

In [None]:
# %timeit is an IPython magic function that allows you to time lines of code
# Use it to help test your code and make it faster!

letters = ['a', 'b', 'c', 'd']
%timeit ' '.join(letters)
%timeit '%s %s %s %s' %(letters[0], letters[1], letters[2], letters[3])

### Datetime
* `datetime` is more capable and user friendly than `time` when handling dates.
* 5 important classes:
   - `datetime` – Handles times and dates together (month, day, year, hour, second, microsecond).
   - `date` – Handles dates independent of time (month, day, year).
   - `time` – Handles time independent of date (hour, minute, second, microsecond).
   - `timedelta` — Durations of and differences in datetime.

In [None]:
import datetime

# We can create datetime objects
date0 = datetime.datetime.now()
print(date0)

date1 = datetime.datetime(2022, 4, 1, 12, 32, 54)
print(date1)

date2 = datetime.date(2016, 6, 26)
print(date2)

date3 = datetime.time(2, 24, 13)
print(date3)

date4 = datetime.timedelta(days=50, minutes=50)
print(date4)

In [None]:
# We can extract values from datetime objects
print(date0.year)
print(date1.month)
print(date2.day)
print(date3.hour)

print(date4.total_seconds()) # timedelta behaves slightly differently

In [None]:
# We can also perform basic date arithmetic

# Get timedelta between datetimes
print(date0 - date1)

# Add/subtract timedelta to/from datetime
print(date0 + date4)


In [None]:
# Custom extraction/printing of datetimes is done with strptime / strftime functions:
# https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
# Datetimes can be read/printed with specific format codes

xmas_string = '25 December 2022'
xmas_datetime = datetime.datetime.strptime(xmas_string, '%d %B %Y')
print(xmas_datetime)

date0 = datetime.datetime.now()
date0_string = datetime.datetime.strftime(date0, '%d/%m/%Y')
print(date0_string)


### Unicode
* Computers need to handle a wide array of characters: Latin alphabet, punctuation, accents, Chinese characters, emoji...
* Historically, not all of these have been supported. There are even competing/conflicting standards.
* An 'encoding' determines how a computer translates the bytes it reads into the character you see.
* "**Unicode**" is a standard format for consistent encoding, representation, and handling of text.
    - Several different Unicode encodings exist.
* Default encoding in Python, and many other places, is **UTF-8**: "Unicode Transformation Format – 8-bit"
    - Other unicode and non-unicode encodings exist.
* Any string you see within Python is Unicode (and almost always UTF-8).
* Sometimes (older) files, particularly across different languages, are encoded differently.
    - There is no foolproof way of detecting encoding!
    - Sometimes you have to rely on domain knowledge, and manual inspection to decode files.
    - Pray that you don't have to worry about this.

In [None]:
# We can convert the character strings we see to "bytestrings" for a particular encoding

s = 'Gürzenichstraße'

# The output is an ASCII representation (i.e. simple characters only) of the bytestring
print(s.encode('utf-8'))
print(s.encode('utf-16'))

# Encoding in one encoding, then decoding in another encoding often raises an error

try:
    print(s.encode('utf-16').decode('utf-8'))
except Exception as ex:
    print(ex)

# Though not always...

try:
    print(s.encode('utf-8').decode('utf-16'))
except Exception as ex:
    print(ex)
    
# This is why detecting encoding can be difficult!

In [None]:
# Let's try read a file 'encodings.txt'

# with open('data/encodings.txt', 'r') as f:
#     s = f.read()
    
# This fails, as the file is not encoded in utf-8 (or rather, utf-8 cannot decode it)

In [None]:
# I know that the file was encoded with the Chinese 'gb18030' format
# So we can specify that when reading the file

with open('data/encodings.txt', 'r', encoding='gb18030') as f:
    s = f.read()

print(s)

In [None]:
# If we don't know the encoding, we can try use the chardet package to guess it
import chardet

with open('data/encodings.txt', 'rb') as f: # read *binary*
    s = f.read()
    
print(chardet.detect(s))
print(s.decode(chardet.detect(s)['encoding']))
# Well, it decoded the file, but not as we hoped...

In [None]:
# Within an encoding, there are sometimes different ways of representing a character

import unicodedata

s1 = 'Gürzenichstraße'
s2 = 'Gürzenichstraße'

print(s1, s2, s1==s2)
print()
# Visually, these strings look the same, but they are not returning True when compared
# They have the same unicode encoding, so what is happening?

# Let's look at the bytestrings
print(s1.encode())
print(s2.encode())
print()

# The accented character is encoded in two different ways (effectively 2 different characters)
# We need to Normalize the strings

n1 = unicodedata.normalize('NFD', s1)
n2 = unicodedata.normalize('NFD', s2)
print(n1, n2, n1==n2)


### NumPy
* NumPy is a widely used Python library for handling multidimensional arrays and performing efficient mathematical operations on them.
* Core object is the ndarray.
* Used under-the-hood of many other popular packages.

In [None]:
import numpy as np

# Create 1 dimensional arrays:

x = np.array([1, 2, 3])
y = np.array([4, 5, 6])

print(x, y, x+y)

In [None]:
# Create 2 dimensional arrays:

x = np.array([[1, 2, 3], [4, 5, 6]])
y = np.array([[1, 2, 0], [8, 3, 4]])

print(x)
print(y)
print(x+y)

In [None]:
# Access array elements with familiar index operations:

print(x[0, 1])
print(x[0])
print(x[:, 1])
print(x[:, :2])

In [None]:
# Quickly create arrays:

print(np.zeros(10))
print(np.ones(15))
print(np.arange(0, 10, 0.2))

In [None]:
# Get the shape of, and reshape arrays:

print(x)
print(x.shape)
print()
print(x.reshape(6)) # a 1D array
print(x.reshape(1, 6)) # a 2D array with shape 1, 6
print(x.reshape(6, 1)) # a 2D array with shape 6, 1
print(x.reshape(3, 2)) # a 2D array with shape 3, 2

## 🏋️‍♀️ PRACTICE

In [None]:
# Q5: Read the file "mystery.txt" and print the character string output


In [None]:
# Q6: Read the file 'randomarray.txt' into numpy. 
# Use numpy to take the logarithm of all elements.
# Reshape the array from (x, y) to (y, x) shape
# Write the array to a new file 'log_randomarray_T.txt'


In [None]:
# Q7: Read all the WTA_*.csv files and join them together (without using the csv module).
# Parse the data such that each line is a list, and each element is the correct type (str/int/datetime/etc.)
# Each line: []
# Save the final object as a pickle file


In [None]:
# Q8: Choose a World Bank indicator from https://data.worldbank.org/indicator
# Download the data and read the file(s) as appropriate data structures in Python
