# Pandas Part 2
Link: https://youtu.be/zmdjNSmRXF4

In [1]:
import pandas as pd

A dataframe is a 2-D size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It can be understood as simply a combination of rows and columns or as a native Python object such as a dcitionary with the name of the columns as keys and the column values as a value list. A dataframe is a container for multiple Series objects.

```python
type(people_df)
pandas.core.frame.DataFrame
```
Columns in pandas are series objects. Series are 1-D, size-mutable, labeled arrays. A series is basically a list of data. Series can be understood as being rows of a single column. Each element in a Series is associated with an index, which can be used to access individual elements.

```python
type(people_df['email'])
pandas.core.series.Series
```



In [2]:
people = {
    'first': ['John', 'Jane', 'Jim'],
    'last': ['Doe', 'Smith', 'Brown'],
    'email': ['john.doe@example.com', 'jane.smith@example.com', 'jim.brown@example.com']
}

In [3]:
people_df = pd.DataFrame(people)
people_df

Unnamed: 0,first,last,email
0,John,Doe,john.doe@example.com
1,Jane,Smith,jane.smith@example.com
2,Jim,Brown,jim.brown@example.com


In [4]:
type(people_df)

pandas.core.frame.DataFrame

In [5]:
type(people_df['email'])

pandas.core.series.Series

In [6]:
# Accesing columns
people_df['email']

0      john.doe@example.com
1    jane.smith@example.com
2     jim.brown@example.com
Name: email, dtype: object

In [7]:
people_df.email #Another way of accesing columns, not recommended

0      john.doe@example.com
1    jane.smith@example.com
2     jim.brown@example.com
Name: email, dtype: object

In [8]:
#Accesing multiple columns
people_df[['email','last']] # This is a dataframe, not a series

Unnamed: 0,email,last
0,john.doe@example.com,Doe
1,jane.smith@example.com,Smith
2,jim.brown@example.com,Brown


In [9]:
# List of all columns
people_df.columns

Index(['first', 'last', 'email'], dtype='object')

In [10]:
# Accesing a row
# iloc allows us to access rows by integer location
# rows are the first index and columns the second one
people_df.iloc[[0,1],[0,1]]

Unnamed: 0,first,last
0,John,Doe
1,Jane,Smith


In [11]:
# loc allows us to access rows by label
people_df.loc[[0,1],['email','last']]

Unnamed: 0,email,last
0,john.doe@example.com,Doe
1,jane.smith@example.com,Smith


In [12]:
path_df = 'data/survey_results_public.csv'
path_schema = 'data/survey_results_schema.csv'

#Options to actually see all rows and columns
pd.set_option('display.max_columns', 10) # Can also be displayed to the number of columns
pd.set_option('display.max_rows', 10)

df = pd.read_csv(path_df)
schema_df = pd.read_csv(path_schema)
df.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,...,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,...,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,...,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,...,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,...,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",...,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


In [13]:
df.shape

(88883, 85)

In [14]:
# Grab all of the responses for the Hobbyist column
df['Hobbyist']

0        Yes
1         No
2        Yes
3         No
4        Yes
        ... 
88878    Yes
88879     No
88880     No
88881     No
88882    Yes
Name: Hobbyist, Length: 88883, dtype: object

In [15]:
# Counts values in a specific column
df[['Hobbyist']].value_counts()

Hobbyist
Yes         71257
No          17626
Name: count, dtype: int64

In [16]:
# Grab a specific row and column (check if the first respondent is a Hobbyist)
(df.iloc[0,2])

'Yes'

In [17]:
# Check the first five respondents' answers from Hobbyist to Employment
df.loc[0:5,'Hobbyist':'Employment'] # Slicing is inclusive of the end value

Unnamed: 0,Hobbyist,OpenSourcer,OpenSource,Employment
0,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work"
1,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work"
2,Yes,Never,The quality of OSS and closed source software ...,Employed full-time
3,No,Never,The quality of OSS and closed source software ...,Employed full-time
4,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time
5,Yes,Never,The quality of OSS and closed source software ...,Employed full-time
