# Week 4: An intro to Python, Pandas and Jupyter 

## Environment basics
- Class [Github repo](https://github.com/stiles/usc)
- Structure and workflow
- Clone repo using Github Desktop | [documentation](https://docs.github.com/en/desktop/contributing-and-collaborating-using-github-desktop/adding-and-cloning-repositories/cloning-a-repository-from-github-to-github-desktop)
- Set a directory structure locally and stick to it!

### Jupyter Lab interface basics
- Directory pane
- [Documentation](https://jupyterlab.readthedocs.io/en/stable/)
- Typing and executing code in cells
- Cell types
- Running/restarting a notebook
- Markdown/documentation
- Keyboard shortcuts
- Copy/paste cells

### Import Python tools
These are the Python libraries we will use to complete our work. Here we only need to import [Pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started). 

In [1]:
import pandas as pd

### Now what? 

#### Let's start with simple math

In [2]:
10 + 6

16

In [3]:
10 + 5 

15

#### Defining a variable

In [4]:
number = 100

In [5]:
print(number)

100


In [6]:
number

100

In [7]:
number + 3

103

#### Conduct a simple data analysis

In [8]:
my_list = [2, 4, 6, 8, 10, 12, 14, 16]

In [9]:
my_list

[2, 4, 6, 8, 10, 12, 14, 16]

In [10]:
my_series = pd.Series(my_list)

In [11]:
# my_series = pd.Series(np.random.randint(10, 1000,size=100000000))

In [12]:
my_series

0     2
1     4
2     6
3     8
4    10
5    12
6    14
7    16
dtype: int64

#### Descriptive statistics

Once the data becomes a Series, you can immediately run a wide range of [descriptive statistics](https://en.wikipedia.org/wiki/Descriptive_statistics). Let’s try a few.

In [13]:
my_series.sum()

72

#### Then find the maximum value in the next

In [14]:
my_series.max()

16

#### The minimum value in the next

In [15]:
my_series.min()

2

#### How about the average (also known as the mean)? Keep adding cells and calculating new statistics.

In [16]:
my_series.mean()

9.0

#### The median?

In [17]:
my_series.median()

9.0

#### The standard deviation?

In [18]:
my_series.std()

4.898979485566356

####  And all of the above, plus a little more about the distribution, in one simple command.

In [19]:
my_series.describe()

count     8.000000
mean      9.000000
std       4.898979
min       2.000000
25%       5.500000
50%       9.000000
75%      12.500000
max      16.000000
dtype: float64

---

## Import data

#### Read a CSV file with members of Congress

In [20]:
df_csv = pd.read_csv('../../data/raw/members_of_congress_117.csv')

In [None]:
#### Or an Excel file

In [21]:
df_excel = pd.read_excel('../../data/raw/members_of_congress_117_excel.xlsx')

#### Import from a URL

In [22]:
df_url = pd.read_csv('https://raw.githubusercontent.com/stiles/notebooks/master/congress/output/members_of_congress_117.csv')

#### Make a copy and assign a new variable

In [23]:
df = df_csv.copy()

---

## Understanding the dataframe

#### Use the `describe()` to get summary stats on any numerical columns

In [24]:
df.describe()

Unnamed: 0,seniority,session,dw_nominate,votes_with_party_pct
count,548.0,548.0,385.0,542.0
mean,10.875912,117.0,0.02373,95.338506
std,8.923969,0.0,0.456891,5.383729
min,1.0,117.0,-0.759,55.09
25%,4.0,117.0,-0.393,94.12
50%,8.0,117.0,-0.197,97.51
75%,15.0,117.0,0.464,98.605
max,50.0,117.0,0.897,100.0


#### Use the `info()` method for data types and columns

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 548 entries, 0 to 547
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    548 non-null    object 
 1   seniority             548 non-null    int64  
 2   full_name             548 non-null    object 
 3   apstate               536 non-null    object 
 4   first_name            548 non-null    object 
 5   last_name             548 non-null    object 
 6   chamber               548 non-null    object 
 7   session               548 non-null    int64  
 8   state                 548 non-null    object 
 9   party                 548 non-null    object 
 10  dw_nominate           385 non-null    float64
 11  votes_with_party_pct  542 non-null    float64
dtypes: float64(2), int64(2), object(8)
memory usage: 51.5+ KB


In [26]:
df.columns = df.columns

#### Reading one column, or "series"

In [28]:
df['last_name']

0           Adams
1        Aderholt
2         Aguilar
3           Allen
4          Allred
          ...    
543        Warren
544    Whitehouse
545        Wicker
546         Wyden
547         Young
Name: last_name, Length: 548, dtype: object

#### Counting values in categorical or string columns

In [33]:
df['party'].value_counts()

D     278
R     268
ID      2
Name: party, dtype: int64

---

## Interacting with the data

#### Use the `head()` method to see the first *n* rows

In [34]:
df.head(20)

Unnamed: 0,id,seniority,full_name,apstate,first_name,last_name,chamber,session,state,party,dw_nominate,votes_with_party_pct
0,A000370,10,Alma Adams,N.C.,Alma,Adams,house,117,NC,D,-0.465,98.65
1,A000055,26,Robert Aderholt,Ala.,Robert,Aderholt,house,117,AL,R,0.376,97.02
2,A000371,8,Pete Aguilar,Calif.,Pete,Aguilar,house,117,CA,D,-0.294,98.43
3,A000372,8,Rick Allen,Ga.,Rick,Allen,house,117,GA,R,0.696,93.12
4,A000376,4,Colin Allred,Texas,Colin,Allred,house,117,TX,D,,97.54
5,A000369,12,Mark Amodei,Nev.,Mark,Amodei,house,117,NV,R,0.38,95.43
6,A000377,4,Kelly Armstrong,N.D.,Kelly,Armstrong,house,117,ND,R,,95.19
7,A000375,6,Jodey Arrington,Texas,Jodey,Arrington,house,117,TX,R,0.648,89.31
8,A000148,2,Jake Auchincloss,Mass.,Jake,Auchincloss,house,117,MA,D,,98.65
9,A000378,4,Cynthia Axne,Iowa,Cynthia,Axne,house,117,IA,D,,96.87


#### Use the `tail()` method to see the first *n* rows

In [35]:
df.tail()

Unnamed: 0,id,seniority,full_name,apstate,first_name,last_name,chamber,session,state,party,dw_nominate,votes_with_party_pct
543,W000817,9,Elizabeth Warren,Mass.,Elizabeth,Warren,senate,117,MA,D,-0.759,97.91
544,W000802,15,Sheldon Whitehouse,R.I.,Sheldon,Whitehouse,senate,117,RI,D,-0.355,99.05
545,W000437,15,Roger Wicker,Miss.,Roger,Wicker,senate,117,MS,R,0.378,91.2
546,W000779,26,Ron Wyden,Ore.,Ron,Wyden,senate,117,OR,D,-0.33,98.87
547,Y000064,5,Todd Young,Ind.,Todd,Young,senate,117,IN,R,0.465,90.33


#### Sorting with the `sort_values()` method to find member with most seniority

In [36]:
df.sort_values('seniority')

Unnamed: 0,id,seniority,full_name,apstate,first_name,last_name,chamber,session,state,party,dw_nominate,votes_with_party_pct
483,H001089,1,Joshua Hawley,Mo.,Joshua,Hawley,senate,117,MO,R,,86.37
454,B001310,1,Mike Braun,Ind.,Mike,Braun,senate,117,IN,R,,90.29
504,M001198,1,Roger Marshall,Kan.,Roger,Marshall,senate,117,KS,R,0.560,92.21
492,K000377,1,Mark Kelly,Ariz.,Mark,Kelly,senate,117,AZ,D,,97.73
493,K000393,1,John Kennedy,La.,John,Kennedy,senate,117,LA,R,0.593,91.72
...,...,...,...,...,...,...,...,...,...,...,...,...
188,H000874,42,Steny Hoyer,Md.,Steny,Hoyer,house,117,MD,D,-0.380,98.41
340,R000395,42,Harold Rogers,Ky.,Harold,Rogers,house,117,KY,R,0.338,97.03
378,S000522,42,Christopher Smith,N.J.,Christopher,Smith,house,117,NJ,R,0.167,88.51
497,L000174,47,Patrick Leahy,Vt.,Patrick,Leahy,senate,117,VT,D,-0.360,99.62


#### Sorting with the `sort_values()` method to find member who's most liberal or conservative

In [39]:
df.sort_values('dw_nominate', ascending=True).head()

Unnamed: 0,id,seniority,full_name,apstate,first_name,last_name,chamber,session,state,party,dw_nominate,votes_with_party_pct
543,W000817,9,Elizabeth Warren,Mass.,Elizabeth,Warren,senate,117,MA,D,-0.759,97.91
481,H001075,3,Kamala Harris,Calif.,Kamala,Harris,senate,117,CA,D,-0.709,100.0
197,J000298,6,Pramila Jayapal,Wash.,Pramila,Jayapal,house,117,WA,D,-0.691,97.52
238,L000551,26,Barbara Lee,Calif.,Barbara,Lee,house,117,CA,D,-0.681,97.96
427,W000187,32,Maxine Waters,Calif.,Maxine,Waters,house,117,CA,D,-0.655,95.12


---

## What questions would you ask of this dataset? 

In [None]:
df.groupby(['state'])['id'].count()

---

## Export

In [None]:
df.to_csv('../../data/processed/members_of_congress_117.csv', index=False)

In [None]:
df.to_excel('../../data/processed/members_of_congress_117.xlsx')