### Why Python?
   * Gateway to learn many skills
   * Shallow learning curve 
       * Simple syntax
       * Extensive documentation
       * Lots of internet resources
   * Free and open source
   * Interacts with various platforms
       * Database API - connecting to MySQL, Oracle, etc.
       * Boto3 library - interacting with Amazon Web Services
   * Variety of Useful Libraries for Data Analysis and Data Science
       * NumPy and Pandas - core libraries for data analysis
       * StanfordNLP - natural language analysis
       * BeautifulSoup - web scraping
       * Requests - HTTP requests
       
### What can you do with Python?
   * Automation of Repetetive Tasks
       * Download Lodes data from Census 
   * Natural Language Processing 
       * Classify land use reform articles by topic
   * Web Scraping
       * Scrape DC Court Website for regulation data

### Running Python
   * Python Version 3 is recommended
       * Improved standard library modules, security fixes, bug fixes
   * Ways to install on your machine:
       1. Install Python from [source](https://www.python.org/downloads/)
       2. Install Anaconda from [Anaconda site](https://www.anaconda.com/distribution/) (Reccomended) 
           * Anaconda is a Data Science platform, and comes with
               * Python language and necessary libraries pre-installed
               * Package manager Conda 
               * Graphical Interface Anaconda Navigator 
               * IDEs including Jupyter and Spyder

## Intro To Python Demo

Jupyter notebook allows you to create documents that contain code, markdown text, and visualizations 


### How to run Jupyter Notebook on a temporary server without installation

* Visit the link https://jupyter.org/try
* Select `Try Classic Notebook`. This might take a few minutes to load. 
* In the File tab, select `New Notebook` -> `Python 3`

### Assigning values to variables
`=` is used to assign values to variables

A variable name on the lefthand side of the operator is assigned a value on the righthand side. 

(To print the value of an object, use the `print()` function)

In [1]:
a = 2
print(a)

2


### Comparators
Comparator operators are used to compare objects. 

`==` is a comparator, and is different from the assignment operator `=`.

`==` will return a True or False boolean value, depending on whether the two values are equivalent. 

In [2]:
b = 2

a == b

True

Other comparators include `!=`, `>`, `<`, `>=`, and `<=`. Comparators will always return a boolean `True` or `False`.

### Data Types

Variables can store different types of data.

To determine the type of an object, use the `type()` function.

We will discuss 4 important variable types.

#### bool - True or False boolean value

In [3]:
x = True
type(x)

bool

There are three main logical operators that can be applied to bool values:

* `x and y` : True if x and y are True, else False <br />
* `x or y` : True if either x or y is true, else False <br />
* `not x:` Negates value of x <br />

In [4]:
x = True
y = False

In [5]:
x and y

False

In [6]:
x or y

True

In [7]:
not x

False

#### int - Integer numbers

In [8]:
a = 5
type(a)

int

#### float - Decimal numbers

In [9]:
b = 1.0
type(b)

float

#### string - Text data

In [10]:
c = "hello world"
type(c)

str

#### String Manipulation

Strings can be concatenated with the `+` operator.

In [11]:
str_1 = "I love"
str_2 = "Python"
str_1 + " " + str_2

'I love Python'

Values in strings can be substituted using the format function

In [12]:
print("I love {} and {}.".format("The Urban Institute", "Python"))

I love The Urban Institute and Python.


### Data Structures

Data structures are ways of organizing data.

#### List 

Lists are one of the most common ways to hold a sequence of values in Python. Lists are created as a comma separated sequence of values between square brackets.

In [13]:
names = ["Michelle", "Vivian", "Kyle", "Aplhonse", "Alena"]
names[0]

'Michelle'

Using the negation operator `-` will transverse the list in reverse.

In [14]:
names[-1]

'Alena'

We can use `:` to slice a list based upon a [start, end] index. The end index is not included in the returned list

In [15]:
names[1:4]

['Vivian', 'Kyle', 'Aplhonse']

#### Dictionary

Dictionaries are a store of key-value pairs. They can be created with curly braces, with keys and values separated by a colon. 

The key is placed on the lefthand side of the colon while the value is placed on the righthand side. 

In [None]:
birth_year = {"Jenny" : 1990, "Cindy" : 1977, "Bob" : 2001}

In [None]:
birth_year["Cindy"]

1977

### Control flow

The if, elif (i.e., else-if) and else statements are used to control the execution of code based upon a condition. The if statement checks the condition given, and executes the code if it evaluates to True.

Note: In Python, blocks of code are indicated by tabs. To ensure that your code is evaluated within the `if` statement, make sure that the code is tab indented one level more than the `if`.

In [None]:
y = 11
if y <= 10:
    print("y is less or equal to 10")
else:
    print("y is greater than 10")

y is greater than 10


If, you want to evaulate multiple conditions, your `if` statement can be followed by one or more `elif` statements. Once a True statement is reached, the execution of the if/elif/else block stops.

In [None]:
z = 20
if z < 10:
    print("z is less than 10")
elif z < 15:
    print("z is less than 15")
elif z < 25:
    print("z is less than 25")
else:
    print("z is greater than or equal to 25")

z is less than 25


### Loops

In Python, loops are used to repeatedly run lines of code.

#### For Loops

The `for` statement allows you to iterate over, and do something with, each item in a sequence. For a simple example, iterate over a list of numbers, and print the squares.

In [None]:
a = [1, 2, 3, 4, 5]

for number in a:
    print(number**2)

1
4
9
16
25


#### While Loops
A while loop will iterate until a condition is met. Be careful, as it can be easy to write a while loop with a condition that is never met, leaving your code running indefinitely in an infinite loop.

In [None]:
b = 0
while b < 5:
    print(b)
    b += 1

0
1
2
3
4


### List Comprehension

A list comprehension is used to create a list based on an existing list

In [None]:
a = [1, 2, 3, 4, 5]

a_squared = [number**2 for number in a]

a_squared

[1, 4, 9, 16, 25]

# Pandas

#### Pandas (Python Data Analysis Library) is the most popular python library that is used for data manipulation and analysis. 

- highly optimized performance: back-end source code purely written in C or Python
- Compatiable with lots of data file formats (CSV, Excel, STATA files)
- Creates a Python object named data frame with rows and columns that looks very similar to table in a statistical software
- Data frame is much easier to work with compared to lists/dictionaries through for loops or list comprehension
- Open source 

#### Three ways to create a new pandas dataframe 

- Convert a Python’s list, dictionary or Numpy array to a Pandas data frame
- Open a local file using Pandas (e.g., CSV, Excel, STATA, json) 
- Open a remote file or database through an URL.

## Practice

#### Lode data 

In [None]:
#import pandas 
import pandas as pd

#Load data from a CSV file 
dat = pd.read_csv("https://ui-spark-data-public.s3.amazonaws.com/lodes/summarized-files/Tract_level/fed_jobs/rac_all_fed_tract_level.csv")

In [None]:
dat.iloc[0:10]

### Inspect Dataframe

In [None]:
#Basic info of the dataframe 
dat.shape

print('Number of rows:', dat.shape[0])
print('Number of columns:', dat.shape[1])

#len(dat)     #number of rows 
#dat.count()  #the number of non-NA values for each column 

In [None]:
#Column names
dat.columns

In [None]:
# Detailed info for each column
dat.info()

### Subset Dataframe: "locate" method 

DataFrame.loc
- Label-based: Access a group of rows and columns by label(s) or a boolean array.

DataFrame.iloc
- Integer-based: Access a group of rows and columns by integer position(s).

![image.png](attachment:image.png)


##### Order of the indexes/labels inside the brackets obviously matters! 

dat.loc[0, 'year']

- Seperated by comma
- The first index/label(before comma): row(s) that we want to retrieve.
- The second index/label(after comma): column(s) we want to retrieve. Optional, without a second index, iloc/loc will retrieve all columns by default.
- Retrieve all rows but selected columns: use Colon(:) before comma to represent all columns, e.g. dat.loc[:, ['year', 'c000']]


Most common inputs in the bracket []:
- A single label/index, e.g., 'year' 
- A list of labels/indexes, e.g. ['year', 'stname', 'ctyname'] or [0, 5, 9]
- A slice object with labels/indexes, e.g. 'year':'c000', 0:9

In [None]:
#A single label for row and column
dat.iloc[0]   #return a series 

In [None]:
#A list of labels/indexes
dat.iloc[[0, 3, 7]]

In [None]:
#A slice object with labels/indexes
dat.iloc[0:5]

In [None]:
#A list of labels for rows and columns
dat.loc[0:6, ['year', 'stname', 'trctname', 'c000']]   

# note that 0:6 is interpreted as a label of the index, but not as an integer position along the index)

In [None]:
# .loc[] can access a group of rows and columns by a boolean array - conditionally subset the dataframe 
va_dat = dat.loc[dat['stname'] == 'Virginia', ['year', 'stname', 'ctyname', 'trctname', 'c000', 'ca01']]
va_dat.iloc[700:703]

### Rename, groupby, operations, sort 

In [None]:
dat_renamed = dat.rename(columns = {'c000': 'total_jobs'})

In [None]:
#Groupby two columns and return the mean of the remaining column.
avg_jobs_by_cty = va_dat.groupby(['ctyname', 'year']).mean()
avg_jobs_by_cty.iloc[0:24]

In [None]:
# Groupby one column and return the mean of only particular column in the group.
avg_total_jobs_by_cty = va_dat.groupby(['ctyname', 'year'])['c000'].mean().reset_index()
avg_total_jobs_by_cty

In [None]:
#sort values 
sorted_cty_jobs = avg_total_jobs_by_cty.sort_values(by = ['c000'], ascending = False)
sorted_cty_jobs.iloc[0:15]

### Write out 

In [None]:
sorted_cty_jobs.to_csv('subset_df.csv', index = False)