# PyTutorial 3.0: Introduction to Pandas - Installation and Loading Data

This part of the tutoral focusses on the Pandas library for data manipulation and analysis.
"Pandas" is derived from the term "panel data", as well as a play on the phrase "Python data analysis".

Here are some highlights of what Pandas can do:

- A fast and efficient DataFrame object for data manipulation with integrated indexing;
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures for size mutability;
- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Highly optimized for performance, with critical code paths written in Cython or C.

In this part, you will learn how to import Pandas and load some example data to gain a basic understanding of Data Frames.
To get started, you will need to install Pandas using pip:
- From a terminal, type: `pip install pandas`

We will also make use of public data from StackOverflow's annual software developer survey (http://bit.ly/SO-Survey-Download).
Results from a recent survey have been downloaded to the `Data` folder in the same directory as this file.

In [52]:
# First import os and pandas:
import os
import pandas as pd

# Get the location of the current working directory (where this file resides):
cwd = os.getcwd()
# Define the absolute path for the survey results csv file:
file_path = os.path.join(cwd,'Data','survey_results_public.csv')
# Read csv file into a data frame:
df = pd.read_csv(file_path)
# Display the data frame:
df

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65432,65433,I am a developer by profession,18-24 years old,"Employed, full-time",Remote,Apples,Hobby;School or academic work,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","On the job training;School (i.e., University, ...",,...,,,,,,,,,,
65433,65434,I am a developer by profession,25-34 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects,,,,...,,,,,,,,,,
65434,65435,I am a developer by profession,25-34 years old,"Employed, full-time",In-person,Apples,Hobby,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Other online resources (e.g., videos, blogs, f...",Technical documentation;Stack Overflow;Social ...,...,,,,,,,,,,
65435,65436,I am a developer by profession,18-24 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Apples,Hobby;Contribute to open-source projects;Profe...,"Secondary school (e.g. American high school, G...",On the job training;Other online resources (e....,Technical documentation;Blogs;Written Tutorial...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,


In [53]:
# As you can see, the data frame contains rows and columns of data like a spreadsheet, but not all of the data contained in the csv file are displayed.
# By default, when data frames are printed to the screen, only the first 5 rows (the "head") and the last 5 rows (the "tail") are displayed. Similarly, only the first and last 10 columns are shown.

# We can determine the actual size of our data frame in a tuple using the "shape" attribute of the data frame:
print('df.shape:', df.shape)
nrows, ncols = df.shape

# We can also use the "info" method for more information (size and data types)
print('df.info():')
df.info()

df.shape: (65437, 114)
df.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65437 entries, 0 to 65436
Columns: 114 entries, ResponseId to JobSat
dtypes: float64(13), int64(1), object(100)
memory usage: 56.9+ MB


In [54]:
# To change the default options, such as the maximum number of rows or columns displayed:
pd.set_option('display.min_rows', None) ## Needed to enforce 'max_rows' option
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 8)
df
# Note the first column displayed is an automatically assigned row identifier, which is not part of the data set.

Unnamed: 0,ResponseId,MainBranch,Age,Employment,...,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",...,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",...,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",...,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",...,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",...,Too short,Easy,,
5,6,I code primarily as a hobby,Under 18 years old,"Student, full-time",...,Appropriate in length,Easy,,
6,7,"I am not primarily a developer, but I write co...",35-44 years old,"Employed, full-time",...,Too long,Neither easy nor difficult,,
7,8,I am learning to code,18-24 years old,"Student, full-time;Not employed, but looking f...",...,Appropriate in length,Difficult,,
8,9,I code primarily as a hobby,45-54 years old,"Employed, full-time",...,Appropriate in length,Neither easy nor difficult,,
9,10,I am a developer by profession,35-44 years old,"Independent contractor, freelancer, or self-em...",...,Too long,Easy,,


In [None]:
# To reset these options:
pd.reset_option('display.min_rows')
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
df
# To reset all options one can use 'all' as the argument.

In [55]:
# The 'head' method shows only first 'n' rows, including the header row:
df.head(10)
# or the first 5 rows by default:
# df.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,...,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",...,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",...,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",...,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",...,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",...,Too short,Easy,,
5,6,I code primarily as a hobby,Under 18 years old,"Student, full-time",...,Appropriate in length,Easy,,
6,7,"I am not primarily a developer, but I write co...",35-44 years old,"Employed, full-time",...,Too long,Neither easy nor difficult,,
7,8,I am learning to code,18-24 years old,"Student, full-time;Not employed, but looking f...",...,Appropriate in length,Difficult,,
8,9,I code primarily as a hobby,45-54 years old,"Employed, full-time",...,Appropriate in length,Neither easy nor difficult,,
9,10,I am a developer by profession,35-44 years old,"Independent contractor, freelancer, or self-em...",...,Too long,Easy,,


In [56]:
# Similarly, the 'tail' method shows only last 'n' rows:
df.tail(10)
# or the last 5 rows by default:
# df.tail()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,...,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
65427,65428,"I am not primarily a developer, but I write co...",25-34 years old,"Not employed, but looking for work;Employed, p...",...,,,,
65428,65429,I am a developer by profession,25-34 years old,"Employed, full-time",...,,,,
65429,65430,I am learning to code,18-24 years old,"Student, part-time",...,,,,
65430,65431,I am learning to code,18-24 years old,"Not employed, but looking for work;Employed, p...",...,,,,
65431,65432,I am a developer by profession,45-54 years old,"Employed, full-time",...,,,,
65432,65433,I am a developer by profession,18-24 years old,"Employed, full-time",...,,,,
65433,65434,I am a developer by profession,25-34 years old,"Employed, full-time",...,,,,
65434,65435,I am a developer by profession,25-34 years old,"Employed, full-time",...,,,,
65435,65436,I am a developer by profession,18-24 years old,"Employed, full-time",...,,,,
65436,65437,I code primarily as a hobby,18-24 years old,"Student, full-time",...,,,,
