# Session 1-- Introduction to Python

# Module 1--The basics of Python


- Module 1 covers python basics
    - Notes (Current page)
    - Video

- Module 2 covers downloading
    - Notes
    - Video


## I.1 Running Python

- Running python code 
    - Google's Colab software
        - My recommendation for those just starting
        - Colab can be launched for free from your google drive by clicking on the following button
    - Jupyter notebook
        - You can download this software as a part of Anaconda
        - [Link](https://www.anaconda.com/products/individual)




## I.2 Launching a colab notebook
<img src="Screenshots/colab.png" />


## I.3 Colab basics

- Run code:
    - Press play button next to code line
    - Shift + Return
        - Short cut for macs
        
- Reset notebook
    - Runtime -> Restart runtime
    
- Code versus text cells
    - You cannot run code in a text cell!

# A. The basics:  Data types

## A. Note for R-users

- Python is in many ways similiar to R

- Comment code: #
- Some functions are exactly the same: print()
- Saving an object requires you to write a name and set it equal to whatever object you are interested in saving
- You must run the code in the correct order
- We use libraries!

In [None]:
# This is a comment
print('This is a print function')

In [None]:
print(saved_object)

In [None]:
saved_object = 'SPP'

## A.1 Strings

- Think of it as a piece data that has quotes
- R-users: This is the same as in R!


In [6]:
my_str = 'Python is cool'
print(my_str)

Python is cool


### A.1.1 This function can tell us what type our object is

In [7]:
type(my_str)

str

### A.1.2 Overwriting

In [None]:
my_str = 'Python is NOT cool'
my_str

## A.2 Integers/Float

In [11]:
my_num = 100
type(my_num)

int

In [12]:
my_num = 100.9483859390942949
type(my_num)

float

# B. Lists

- Lists can be made up of strings or numbers
- Spacing is not necessarily important!

In [19]:
my_list = ['Python', 'R', 'Stata', 'Excel']

my_list

['Python', 'R', 'Stata', 'Excel']

In [20]:
my_list = [3, 
           5,
           9,
           11]
type(my_list)

list

## B.1 Indexing

### B.1.1 Selecting first element
#### IMPORTANT: Numbering always begins with zero


In [21]:
my_list = ['Python', 'R', 'Stata', 'Excel']

my_list[0]

'Python'

### B.1.2 Select range of elements

#### IMPORTANT: the final element is NOT INCLUSIVE


In [23]:
# This selects the zeroth, and first elements BUT NOT THE 2nd element
my_list[0:2]

['Python', 'R']

# 1. Introduction to Pandas

### For R users: Pandas is similar to DPLYR!

In general we use it to:

- Manipulate data (Today's topic)
- Viusualize our data (graphing etc.)
- Download data directly from the internet
- Build models (Regression, Machine learning, Neural Networks)

## 1.A We must import the library before using!
- Again this is similar to R
- However unlike R: in order to use the functions (in general) we have to use an acronym to access the functions

In [26]:
import pandas as pd

## 1.1 Import data
- pd.read_stata
- pd.read_csv
- pd.read_excel

In [30]:
url = 'https://raw.githubusercontent.com/corybaird/SPP_Data_Seminar/main/2021_Fall/R/Session_1/vote.csv'
df = pd.read_csv(url)

In [31]:
type(df)

pandas.core.frame.DataFrame

#  2. Data basics
## 2.1 Show datatype, obs, etc


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1502 entries, 0 to 1501
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   state      1502 non-null   object
 1   vote       1502 non-null   int64 
 2   income     1502 non-null   int64 
 3   education  1502 non-null   int64 
 4   age        1502 non-null   int64 
 5   sex        1502 non-null   int64 
dtypes: int64(5), object(1)
memory usage: 70.5+ KB


## 2.2 Head, tail


In [33]:
df.head(2)

Unnamed: 0,state,vote,income,education,age,sex
0,AR,1,9,2,73,0
1,AR,1,11,2,24,0


In [34]:
df.tail(2)

Unnamed: 0,state,vote,income,education,age,sex
1500,SC,1,9,1,36,1
1501,SC,1,8,1,120,1


## 2.3 Descriptive stats

In [35]:
df.describe()

Unnamed: 0,vote,income,education,age,sex
count,1502.0,1502.0,1502.0,1502.0,1502.0
mean,0.855526,12.061252,2.6498,49.278961,0.559254
std,0.351687,4.256935,1.021356,17.591828,0.496642
min,0.0,4.0,1.0,5.0,0.0
25%,1.0,9.0,2.0,36.0,0.0
50%,1.0,13.0,3.0,49.0,1.0
75%,1.0,16.0,4.0,62.0,1.0
max,1.0,17.0,4.0,120.0,1.0


## 2.4 Column names

In [36]:
 df.columns

Index(['state', 'vote', 'income', 'education', 'age', 'sex'], dtype='object')

In [38]:
columns = df.columns

### 2.4.1 Manipulating lists

In [39]:
# The second element to the third element
columns[1:3] # Second element in slice is not inclusive

Index(['vote', 'income'], dtype='object')

In [40]:
# Indexing starts at zero
columns[0:2]

Index(['state', 'vote'], dtype='object')

In [41]:
columns[-2:]

Index(['age', 'sex'], dtype='object')

## 2.5 Select column

In [44]:
df['state'].head(2)

0    AR
1    AR
Name: state, dtype: object

In [45]:
df.state.head(2)

0    AR
1    AR
Name: state, dtype: object

# 3. Bonus material

## 3.1 Importing directly from a website


<img src="Screenshots/Import.png" />



In [46]:
url = 'https://www.macrohistory.net/app/download/9834512569/JSTdatasetR5.xlsx?t=1641215586'
# Note this is a different function than we used before!
# This function is read_excel instead of read_csv
df = pd.read_excel(url, 
                   sheet_name = 1
                  )
df.head(4)

Unnamed: 0,year,country,iso,ifs,pop,rgdpmad,rgdppc,rconpc,gdp,iy,...,eq_capgain,eq_dp,eq_capgain_interp,eq_tr_interp,eq_dp_interp,bond_rate,eq_div_rtn,capital_tr,risky_tr,safe_tr
0,1870,Australia,AUS,193,1775.0,3273.239437,13.836157,21.449734,208.78,0.109266,...,-0.070045,0.071417,,,,0.049118,0.066415,,,
1,1871,Australia,AUS,193,1675.0,3298.507463,13.936864,19.930801,211.56,0.104579,...,0.041654,0.065466,,,,0.048446,0.068193,,,
2,1872,Australia,AUS,193,1722.0,3553.426249,15.044247,21.085006,227.4,0.130438,...,0.108945,0.062997,,,,0.047373,0.069861,,,
3,1873,Australia,AUS,193,1769.0,3823.629169,16.219443,23.25491,266.54,0.124986,...,0.083086,0.064484,,,,0.04672,0.069842,,,


## 3.2 Scraping HTML

In [49]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_cities'
df = pd.read_html(url)
df[1].head(5)

Unnamed: 0_level_0,City[a],Country,UN 2018 population estimates[b],City proper[c],City proper[c],City proper[c],City proper[c],Metropolitan area[d],Metropolitan area[d],Metropolitan area[d],Urban area[8],Urban area[8],Urban area[8]
Unnamed: 0_level_1,City[a],Country,UN 2018 population estimates[b],Definition,Population,Area(km2),Density(/km2),Population,Area(km2),Density(/km2),Population,Area(km2),Density(/km2)
0,,,,,,,,,,,,,
1,Tōkyō,Japan,37400068.0,Metropolis prefecture,13515271.0,2191.0,"6,169[13]",37274000.0,13452.0,"2,771[14]",39105000.0,8231.0,"4,751[e]"
2,Delhi,India,28514000.0,Capital City,16753235.0,1484.0,"11,289[15]",29000000.0,3483.0,"8,326[16]",31870000.0,2233.0,"14,272[f]"
3,Shanghai,China,25582000.0,Municipality,24870895.0,6341.0,"3,922[17][18]",,,,22118000.0,4069.0,"5,436[g]"
4,São Paulo,Brazil,21650000.0,Municipality,12252023.0,1521.0,"8,055[19]",21734682.0,7947.0,"2,735[20]",22495000.0,3237.0,"6,949[h]"
