# Python and Social Data Science

# This talk

- Why social data science 
- Course overview
- Learning to code
- Python 
   - Why we chose it
   - Advanced concepts
- Power tools: git and markdown

## Data science - an overview

Some trends

- **Data** is increasingly available
- Improved **algorithms** and methods for computation
- Faster and bigger **computers**

Some big successes 
- Image and text recognition, e.g. recognize faces or language parsing with Google Translate/GPT-3
- Artificial intelligence, e.g. self-driving cars, play computer games, poker, trading bots
- Combined services, e.g. virtual assistants, recommendation systems 

## Past the peak?
For a couple of years > data scientists had HIGHEST entry wages

Problem > prediction based agenda is flawed. But opens up new opportunities:

- Combine with theory 
- Combine with causal inference, role for econometrics and structural modelling

## Social Data Science

The skills and ideas of data science are spreading beyond

- incorporate machine learning into
    - statistics and causal inference
    - economic modelling
    
- smart, free tools for working with
    - small and big data on structured (tabular) data
    - unstructured data sources from image, text and social media

Take aways > tools useful for social scientists
- enhance existing fields
- new field emerging (new data, combination of methods)



# Course overview

## Module structure

Most teaching modules will have the following struture

- 1. Before exercise class: watch recorded lectures and reading. If you have time, you may attempt to solve exercises
- 2. Exercise class: continue working on exercises and discuss with TA
- 3. Live lecture. The format will very for each module but main compenent is that YOU ask questions.

## The wheel of data science
<br>
<br>
<center><img src='https://raw.githubusercontent.com/hadley/r4ds/master/diagrams/data-science.png' alt="Drawing" style="width: 700px;"/></center>


## Learning outcomes after completing Intro SDS 

- Tidy / transform: data structuring and text (sessions 1-5,15)
- Import: scraping and data IO (sessions 2, 6-8)
- Visualize: plotting (session 3)
- Model: fundamentals of machine learning, application to text (session 11-15)
- Communicate: git and markdown (covered briefly at end of talk)

## What we don't teach you now

Many more courses that built on this
- Statistics: Econometrics and Machine Learning - overlaps
- More about data - modelling and processing:
    - Text, networks/relational, spatial
- Non-linear ML models:
    - Tree based and kernel 
    - Neural networks
- ML for dynamic decisions: reinforcment learning
- All about privacy

# Learning how to code

## Not a free lunch

This course.. ain't easy..

Learning without supervision: 
- Data structuring as experimentation
- Struggle with simple stuff
<center><img src='https://media.giphy.com/media/h36vh423PiV9K/giphy.gif' alt="Drawing" style="width: 400px;"/></center>


## Some encouragement

#### Hadley Wickham

> The bad news is that when ever you learn a new skill you’re going to suck. It’s going to be frustrating. The good news is that is typical and happens to everyone and it is only temporary. You can’t go from knowing nothing to becoming an expert without going through a period of great frustration and great suckiness.


#### Kosuke Imai

> One can learn data analysis only by doing, not by reading.

## Light at the end..

Why would you go through this pain? You choose one of two paths.

i. You move on, you forget some or most of the material.

ii. You are lit and your life has changed. 
- You may return to become a better sociologist, anthropologit, economist etc.
- Or, you may continue along the new track of data science.
- In any case, you keep learning and expanding your programming skills.

## Advice for coding

- Be careful: Think before you code - what you are trying to make it do?

- Be lazy: resuse code and write reusable code (functions)

- Make understandable: think about audience
    1. Future you? May not recall this at all.    
    1. Group members or world? May need some background explanation/documentation.
    


## Advice for learning to code

- Maintain healthy curiousity - how could we do things better?
- Practice and try as much as possible
- Type the code in yourself - then you see what is going on.


# Python vs. R (vs. Julia)

## Tradeoff?

- Each language has their own advantages and similarities (my opinion)

|                     | Python | R | Julia |
|---------------------|--------|---|---|
| Data structuring    | X      | X | |
| Plotting            | X      | X | |
| Machine learning    | X      |   | |
| Statistics          |        | X | |
| General programming and modelling | X      |   | (X)|
| Ease of learning    | X      |   | |


- Tools are increasingly integrated
    - Jupyter a shared framework for data science 
    - New software allows direct execution side-by-side: use R within python (vice versa)
    - New tools becoming available across languages, e.g. data processing engine (arrow)
    


Advice: don't worry. It's likely you need to learn more than one language.

# Python - advanced concepts


## Making code reusable


### Functions
Be careful of input/output and objects created

- Globals: objects defined outside function
    - Note: These are available both locally and globally (although, they can be overwritten)
- Locals: objects that are created within a sub-level. 
    - Example: objects defined inside a function
    - Cannot use outside, unless we use `return`

### Check-list for functions
Some key questions
- Should I write a function? (yes if you are repeating some process)
- If yes:
    - What do you intend the function to output? Is the correct output returned?
    - Do you use locals where possbible? 
    - Are globals assigned before you define the function in notebook/script?
        - (if not your function may fail! reason: some used globals are not defined)


### Classes 
Where do objects come from?

- From a collection of pre-defined attributes and methods, this a called `class`.
- We can make our own. It is complex, but powerful.
- Examples of classes/objects we will see: 
    - dataframes, scraping tools, machine learning model etc.
    - (everything is an object)

## Copy vs. view

Important, when writing code `A = B`, then `A` is only a reference to `B`!

- In other words: `A` is a **view** of `B`
- Implication 
    - if `A` is mutable, e.g. list, dataframe: changes to `B` shows up in `A` and vice versa 

We can break this dependency by explicitly making **copy**. 
- For instance, in pandas use `A = B.copy()` method



## Coding that is fast

### One-liners using comprehensions

A very compact way of writing code: *comprehension*. Example of list comprehension:
```python
new_list = [my_proc(a) for a in my_list] 
```

The new thing is that we define loop inside the list! 


### The need for speed
Python is elegant and simple

However, Python is NOT built for speed. 

We can compensate by using smart packages that have fast algorithms, e.g. numpy and pandas


### Vectorization

An alternative is to write our code in terms of numpy arrays.

In example below we get around 50-100 times speed-up!

In [1]:
%%timeit
[i+3 for i in range(10**5)]

9.24 ms ± 420 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [2]:
import numpy as np

In [3]:
%%timeit
np.arange(10**5)+3

180 µs ± 36.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Take away on speed

Use pandas or numpy - they are optimized speed! 

Why are pandas and numpy fast? 

- Because they are written in fast, low-level code with optimized algorithms! 

Why are we not learning low level? 
- Requires too much space for simple operations - not efficient!
- Too steep learning curve (not like Python!!)

# Power tools: git and markdown

## Git, a non-technical overview

Git is a tool for command line:

1) "Track changes" system for files
- A log of all changes is kept - from nothing to current version
- All changes are explicitly declared by you, may annotate
  - You can try out things, but only save meaningful changes!

2) Share the files you want, how you want 
- A git folder, called **repository**, can be copied by others
- Many sites allow public and private repositories - you decide access
    


Not covered explicitly in course. Learn to use point-and-click version for fetching files.

## Markdown

[LaTeX](https://en.wikipedia.org/wiki/LaTeX) can create beautiful scientific documents.

Problem - background code is heavy to read. Is there alternative?

Yes: markdown. Like python, keeps code simple. Example:
        
Making italic text:
- markdown: `*Some text*`
- LaTeX: `\textit{Some text}`

How do you learn it? Open our notebook cells or see tutorial in reading lsit

# Outro

- Coding is tough, but worth learning
    - Also for social scientists
- You can dig deeper into
   - Advanced python
   - Git for saving and sharing code
   
- Next lecture this afternoon: Strings, queries and APIs