# PyTutorial 3.1 - Data Frame and Series Basics - Selecting Rows and Columns

This module will continue with the Pandas tutorial that we began in the previous module. It will build on top of what was shown in section 3.1

Here we will be learning more about data frames and series data types.
 

In [2]:
# Let's return to where we were in the previous module
import pandas as pd
df = pd.read_csv('/Users/physics14/Desktop/PythonProjects/data/survey_results_public.csv')
schema_df = pd.read_csv('/Users/physics14/Desktop/PythonProjects/data/survey_results_schema.csv')

In [3]:
# Now let's look at our data frame
df.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


As you can see, our data frame is made up of multiple rows and columns. Most data exists in tables, which have rows and columns just like these survey results.

In [4]:
# If we wanted to know what one of the columns mean,
# we can use the schema data frame which will show us the corresponding question in the survey
schema_df

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,MainBranch,Which of the following options best describes ...
2,Hobbyist,Do you code as a hobby?
3,OpenSourcer,How often do you contribute to open source?
4,OpenSource,How do you feel about the quality of open sour...
...,...,...
80,Sexuality,Which of the following do you currently identi...
81,Ethnicity,Which of the following do you identify as? Ple...
82,Dependents,"Do you have any dependents (e.g., children, el..."
83,SurveyLength,How do you feel about the length of the survey...


To better understand data frames we can look at dictionaries.

In [9]:
# Here is an example of a dictionary
people = {
    "first": ["Corey", "Jane", "John"],
    "last": ["Schafer", "Doe", "Doe"],
    "email": ["CoreySchafer@gmail.com", "JaneDoe@email.com", "JohnDoe@email.com"]
}
# Using a dictionary, we can get specific data:
people["email"]

['CoreySchafer@gmail.com', 'JaneDoe@email.com', 'JohnDoe@email.com']

In [10]:
# Now, we can create a data frame from this dictionary
df = pd.DataFrame(people)
df

Unnamed: 0,first,last,email
0,Corey,Schafer,CoreySchafer@gmail.com
1,Jane,Doe,JaneDoe@email.com
2,John,Doe,JohnDoe@email.com


Now we have rows and columns that we can visulize from the dictionary we created.

The 0, 1, and 2 on the left are indexes which you'll learn about later.

In [11]:
# We can access a single column of this data set just like we could with the dictionary
df["email"]

0    CoreySchafer@gmail.com
1         JaneDoe@email.com
2         JohnDoe@email.com
Name: email, dtype: object

Now you can see how data frames are similar to dictionaries of list. However, even though we used the pure python example (dictionary) to explain data sets, we do need to understand that they are very different and that data sets are much more powerful data analysis tools.

In [13]:
# A difference between the 2 examples shown -
# Data frames have a "Series" data type
type(df["email"])

pandas.core.series.Series

A series is a one dimensional array. Or in other words, it's "rows of data". In our data frame, each column is a series (one dimensional), while the entire data frame is made of rows and columns (2 dimensional).

In [15]:
# Another way to access a single column
df.email
# Access multiple columns
df[["last", "email"]] 

Unnamed: 0,last,email
0,Schafer,CoreySchafer@gmail.com
1,Doe,JaneDoe@email.com
2,Doe,JohnDoe@email.com


In [17]:
# To see names of columns use ".columns"
df.columns

first                     Corey
last                    Schafer
email    CoreySchafer@gmail.com
Name: 0, dtype: object

In [None]:
# To see names of rows use ".iloc"
# For example, the name of the first row
df.iloc[0]

In [18]:
# To see multiple rows, use a list
df.iloc[[0, 1]]

Unnamed: 0,first,last,email
0,Corey,Schafer,CoreySchafer@gmail.com
1,Jane,Doe,JaneDoe@email.com


In [19]:
# Now we can also choose a specific column
df.iloc[[0, 1], 2]

0    CoreySchafer@gmail.com
1         JaneDoe@email.com
Name: email, dtype: object

So far we've been using "iloc" which searches by integer location. We can also use "loc" which searches by label. 