# 04 Pandas DataFrame Basics
File(s) needed: Applewood_2011.csv

<p style="color:blue;padding-left:20px;font-size:200%;font-family: Segoe Script, cursive">First, some background ...<p>

## Define data science
Data science is a term whose definition is difficult to nail down. Some people say data scientist is “the sexiest job of the 21st century” (Harvard Business Review, October 2012), while others say it’s just a bunch of hype. We will take a position somewhere between those two views.
- Data science is an interdisciplinary field that combines expertise in statistics, computer science, and the field providing the problem context. 

The qualifications for data scientists are still evolving. However, many of them have advanced degrees in statistics and/or computer science, plus years of experience in a practical field (like marketing). They need this advanced training because they are the people who develop new methodologies and tools for solving problems in the field.


## Define business analytics
Business analytics (BA) is focused more on the application of data science tools on specific business problems. The required training and expertise is also evolving, but it is not as extensive as that for a data scientist, and it is usually heavier on domain expertise. That is where the “business” part comes in.


## Why are we talking about these topics in a Python class?
The answer to that question is simple: analytics (and data science) rely on Python to get things done. Performing data analytics tasks is one of the main environments in which you are likely to use Python in the future. This class is built around that assumption.


## Life is a series of word problems
If you didn’t like doing word problems in math class when you were younger, I have some bad news. 

**_Virtually everything you do in your life is about solving a problem and they are almost always word problems._**

BA is no exception. So we need a process to follow. Let’s call it, oh I don’t know, a “problem-solving process.” Sound familiar? I hope so!
1. Understand the problem
2. Identify the root cause of the problem
3. Develop an effective solution
4. Execute the solution until the problem is solved (with modifications as necessary)

For the remainder of the semester, we are going to focus on some of the things we can do with Python in those first two steps. Then we will take a shot at step three. 


## A more detailed process model
The general problem-solving process is a good start, but there are better, more detailed models we can apply to BA. One is the SDLC. It is intended for software production but could easily be modified for BA. There is another model that you may not have heard of that works well for these kinds of problems.

## CRISP-DM model

The **CR**oss **I**ndustry **S**tandard **P**rocess for **D**ata **M**ining is the most commonly used methodology for analytics or data mining projects. See this link for one graphic: http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html

![CRISP-DM_1.png](attachment:CRISP-DM_1.png)

![CRISP-DM_2.png](attachment:CRISP-DM_2.png)

Here are two graphical representations of the CRISP-DM process. Don’t get caught up in the details of the model. You won’t be held accountable for them here. But the model will help us organize our efforts when trying to solve a problem with data analysis. For one thing it shows how all of the pieces go together.

If we look at the steps listed out we can see where we are most likely to make a contribution with our Python skills.

1. Business Understanding
2. **Data Understanding**
3. **Data Preparation**
4. _Model Building_

5. Testing and Evaluation
6. Deployment

Most of what we can do is in steps 2 and 3. We can learn more about the data we have and then get it ready for analysis. We will also do a simple model (step 4) in a couple of weeks. For right now, doing a good job preparing data is important because that step alone can consume as much as 80% of your total project time!

### Data Understanding
This step in the process can eat up as much as 80% of the overall time it takes to work through an analytics problem. Why? Because data in the wild is never perfect. 

Understand everything you can about the data you will use.
- Where does it come from?
- How was it collected? Stored?
- What format is it in?
- How do you access it?
- Is it the right data? Do you need something else?
- Are there any problems with it? What do you need to do about them?

These issues will be the focus of our next few classes.


### Data Preparation
Get the data “in shape” for mining.
- Combine data sets as necessary.
- Exclude unnecessary variables.
- Use a subset of the data.
- Create new variables when necessary.
- Scrub/clean data to handle errors, outliers, or missing data.
- Make sure all data is formatted consistently.

Our data cleaning activities can be characterized as conditioning and shaping the data. One way we can start to learn something about the data is by visualizing it. There are many great tools for doing so but Python has some tools built into it that allow you to get an idea of what is going on in the data. We will do that in a couple of weeks.


### Model Building
A model is an incomplete, artificial version of reality. **_It is not reality._**
- “All models are wrong but some are useful.” – statistician George Box

We want to go _immediately_ to this step, but that is a painful mistake. That is the analytics-themed equivalent of writing code before you solve the problem.

As previously mentioned, we will come back to this topic later.


# Using pandas
## pandas is a Python library
Now we begin to see one of the reasons why Python is so popular: **libraries**. Libraries are self-contained packages of coded capabilities that can be imported into Python to extend its functionality. One of the most fundamental libraries for working with data is called pandas.


### What is pandas?
>"pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python."  http://pandas.pydata.org/pandas-docs/stable/
    
The main way pandas does this is through the implementation of two data structures: Series and DataFrame. We will learn more about them soon but for now let's load some data and see some of the things we can do.

#### Why use a library?
Have you ever seen a data file where the first row contains column headers? Of course you have! That is way more common than just having rows of data. By using the correct _parser_ in pandas, we can automatically read the headers. Of course, we can program that ourselves in base Python, but why would we want to? Someone else already did a good job of doing it so let's use their code.

That same idea applies to many other computing tasks, too. If the functionality is already available in a library we can put it to work instead of having to write the code ourslves.

### Loading Data

- Use the `import` keyword to `import` the `pandas` library.
- For many libraries, it's typical to 'alias' them with a shorter name for convenience. With pandas we use the alias `pd`.
- Pandas has a `read_csv` function you can use to load text files.

In one of our first cells we add the following import statement. Let's add it now.
```
import pandas as pd
```

In [1]:
# We will use this syntax.
# You will see that this way makes it easier to access the functions.
# You will also see online that this is the way most people do it.
import pandas as pd

ModuleNotFoundError: No module named 'pandas'

In [2]:
# The data has to be stored somewhere when it is read. That means we need
# a location name on the left side of an assignment operator.
df = pd.read_csv("../data/Applewood_2011.csv")

Look at the `read_csv` notation

- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
- `pd`: calls the `pandas` library using the alias we specified
- `.read_csv`: use a period to "look into" the library for the `read_csv` function
- `'../data/Applewood_2011.csv'`: location of the file we want to use
    - `../`: look one directory up
    - `data/`: then look in the data subfolder
    - `Applewood_2011.csv`: use the file named `Applewood_2011.csv`

`read_csv` defaults to expecting comma separated values in the file. Not surprising, since that's what `csv` stands for. However, it will work with other text files using other delimiters like a semi-colon(;), a tab character, or something else. We can add the argument `delimiter='\t'` or `delimiter=';'` to the `read_csv` call to handle other delimiters.

CSV files are just text files. They can be opened in a text editor, the Notebook home folder, or Excel so you can see how they are constructed.

## Dude, where's my data?
The `df` variable we used to store the result of the last code is actually an object called a data frame. A **_data frame_** (also written as DataFrame) is a data structure that holds two-dimensional data, like a table. Much of the data you see is in table format, which means a data frame is an excellent choice for a storage method. The parsers in pandas default to using data frames when they read data that appears to have multiple rows and columns.

Each row in the data frame has an **_index_**. The index value uniquely identifies the row. You can think of it as a *primary key* field for the table. If the data already includes an index, we can tell pandas which column it is in. If there isn't an index already, pandas will create one. The index value is how we reference specific rows. We'll come back to that soon.

The data frame has many built-in methods we can access to look at its contents. We'll use some of them to take a closer look at the data we just read into memory.

### We always want to see what our data looks like
We can do that easily with some built-in functionality.
- Look at the first few rows of your data to visually inspect it
- What is the type of data your dataset is stored as?
- How many rows and columns is in your dataset?
- General information about the columns, its types, missing values, and amount of memory it uses

We can also use the following commands in combination with `print()` to have more control over the output.

#### Look at the `head`
- by default, `head` shows the first 5 rows of the data
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html

```
df.head()```

In [3]:
# type and run the command to view the head
df.head()

Unnamed: 0,Age,Profit,Location,Vehicle-Type,Previous
0,21,1387,Tionesta,Sedan,0
1,23,1754,Sheffield,SUV,1
2,24,1817,Sheffield,Hybrid,1
3,25,1040,Sheffield,Compact,0
4,26,1273,Kane,Sedan,1


In [4]:
# Now do the same command inside print() to see the difference
print(df.head())

   Age  Profit   Location Vehicle-Type  Previous
0   21    1387   Tionesta        Sedan         0
1   23    1754  Sheffield          SUV         1
2   24    1817  Sheffield       Hybrid         1
3   25    1040  Sheffield      Compact         0
4   26    1273       Kane        Sedan         1


Look at the parts of the `head` method.
- `df`: the name of the data frame we saved our data into. We can name it anything as long as it is within the rules of Python and pandas.
- `.head()`: call the head method
    - Use the period (just like calling `read_csv`)
    - Tell the object named `df` to call the `head` method

The `tail` method works the same way to show the last 5 rows by default. 

Both the `head` and `tail` methods can accept an integer argument to allow you to specify how many rows to display.

In [5]:
# view the last 3 rows of the data frame
df.tail(3)

Unnamed: 0,Age,Profit,Location,Vehicle-Type,Previous
177,72,1640,Olean,Sedan,1
178,72,1821,Tionesta,SUV,1
179,73,2487,Olean,Compact,4


#### Data `type`
- `type` is a function (versus a 'method')
    - We write `type(df)` for a function [instead of df.type() for a method]
    - A function operates on an object. A method is part of an object.
- built-in Python function
- Tells you the "type" of the object you pass it

```
type(df)```

In [6]:
# What type is the df object?
type(df)

pandas.core.frame.DataFrame

#### See Table 1.1 on page 6 for a summary table of Python and pandas data types.
---

### Count rows and columns
- `shape` is an "attribute"
    - It's not a function or a method, but rather an "attribute" of the dataframe object
    - It does not use any parentheses after it
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shape.html

```
df.shape()  #will cause an error
df.shape    #will give you the number of rows and columns```

In [7]:
# get number of rows and columns
df.shape

(180, 5)

### Column names
- You can use the `columns` attribute to get the names of the columns in the dataframe

```
df.columns```

In [8]:
# get column names
df.columns

Index(['Age', 'Profit', 'Location', 'Vehicle-Type', 'Previous'], dtype='object')

### Column data types
- Use the `dtypes` attribute to show the type of data stored in each column

```
df.dtypes```

In [9]:
# get column data types
df.dtypes

Age              int64
Profit           int64
Location        object
Vehicle-Type    object
Previous         int64
dtype: object

<h1><span style="color:red">STOP:</span> Let's make sure we all get this before moving on.</h1>

<p style="font-size:175%;text-align:center">What is the difference between a <em style="color:green">method</em>,  a <em style="color:green">function</em>, and an <em style="color:green">attribute</em>?</p>


---

## What's the big picture?
The `info` method allows us to get all of these pieces of information and more about the data we loaded into our `df` object with one command.
- The type of our data structure
- Number of rows
- Number of columns
- Column names
- Number of non-null (i.e., non-missing) values per column
- Type of data in each column
- How much memory is used by this data
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html

```
df.info()```

In [10]:
# get more info about the data with one command
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 5 columns):
Age             180 non-null int64
Profit          180 non-null int64
Location        180 non-null object
Vehicle-Type    180 non-null object
Previous        180 non-null int64
dtypes: int64(3), object(2)
memory usage: 7.1+ KB


## Subsetting the Data
We often want to work with a subset of the overall dataset. There are three general ways we can subset:
1. by column(s) (also referred to in data analysis as "attribute reduction")
2. by row(s) (also referred to as "sampling")
3. by both column(s) and row(s)


### Selecting a single column
There are multiple ways to designate a single column, but in keeping with the "minimally sufficient pandas" idea we will only use one.
- use square brackets [] and the column name as a string to specify a column.
    - Yes, it is case sensitive, so if you get an error make sure you have the right name.

```
df['Location']
```

In [11]:
# Extract the Location column to display it
df['Location']

0       Tionesta
1      Sheffield
2      Sheffield
3      Sheffield
4           Kane
5      Sheffield
6           Kane
7           Kane
8       Tionesta
9      Sheffield
10          Kane
11          Kane
12         Olean
13     Sheffield
14      Tionesta
15          Kane
16     Sheffield
17      Tionesta
18          Kane
19     Sheffield
20      Tionesta
21     Sheffield
22      Tionesta
23          Kane
24         Olean
25         Olean
26      Tionesta
27         Olean
28         Olean
29          Kane
         ...    
150    Sheffield
151        Olean
152         Kane
153        Olean
154    Sheffield
155    Sheffield
156    Sheffield
157    Sheffield
158         Kane
159         Kane
160     Tionesta
161    Sheffield
162    Sheffield
163        Olean
164        Olean
165         Kane
166        Olean
167    Sheffield
168        Olean
169        Olean
170         Kane
171         Kane
172    Sheffield
173     Tionesta
174    Sheffield
175         Kane
176        Olean
177        Ole

Let's extract the Location column again, save it in a separate object, and inspect the first 5 rows.
```
location_df = df['Location']
location_df.head()```

In [12]:
# Extract Location to a new object
location_df = df['Location']
location_df.head()

0     Tionesta
1    Sheffield
2    Sheffield
3    Sheffield
4         Kane
Name: Location, dtype: object

In [13]:
# What data type is the location_df object?
type(location_df)

pandas.core.series.Series

### Selecting multiple columns
Instead of one column name, we provide a list of column names. 
- A **list** in Python is a data structure that contains multiple data elements.
    - If you are not familiar with it it is similar to an array in other languages.
    - We will spend more time with lists later.
- When we create a list, we use elements separated by commas inside square brackets.
- Since we already use square brackets to subset columns, we will **have two sets of square brackets** in this code.

```
df[['Location','Age','Vehicle-Type']].head()
```

In [14]:
# Display the first five rows of the Location, Age, and Vehicle-Type columns.
df[['Location','Age','Vehicle-Type']].head()

Unnamed: 0,Location,Age,Vehicle-Type
0,Tionesta,21,Sedan
1,Sheffield,23,SUV
2,Sheffield,24,Hybrid
3,Sheffield,25,Compact
4,Kane,26,Sedan


### Practice
Extract the Profit and Location columns from the original data frame, save the result in a new object named "sub1_df", and use `print()` to inspect the last 8 rows.


In [15]:
# Exercise: Column extraction
sub1_df=df[['Profit','Location']]
print(sub1_df.tail(8))

     Profit   Location
172    2597  Sheffield
173    2742   Tionesta
174    1837  Sheffield
175    2842       Kane
176    2434      Olean
177    1640      Olean
178    1821   Tionesta
179    2487      Olean


### Selecting rows
There are two ways to specify rows in the data
- `loc` selects rows based on the row **label**
- `iloc` selects rows based on the row **index**
- The two are often the same, but we will work with data later where that is not the case.

See the numbers to the left of the data in the results of the previous cell? In this case they represent both the index and label of each row. pandas automatically assigns an integer row label when the `read_csv` function is used.

Like most programming languages, Python uses **_zero-based indexing_**, which simply means that rows start at 0 instead of 1. So the first row is row 0, the 37th row is row 36, the 487th row is row 486, etc.

Select the first row of the data frame by label

```
df.loc[0]
```

In [16]:
# select first row by label
df.loc[0]

Age                   21
Profit              1387
Location        Tionesta
Vehicle-Type       Sedan
Previous               0
Name: 0, dtype: object

In [17]:
# select the fourth row by label
df.loc[3]

Age                    25
Profit               1040
Location        Sheffield
Vehicle-Type      Compact
Previous                0
Name: 3, dtype: object

To select **multiple rows by label** you can provide them in a list the way we did for column names.


In [18]:
# Select the first, 90th, and 178th rows by label
# Remember: zero-based index and you will have double square brackets
df.loc[[0,89,177]]

Unnamed: 0,Age,Profit,Location,Vehicle-Type,Previous
0,21,1387,Tionesta,Sedan,0
89,46,978,Kane,Sedan,1
177,72,1640,Olean,Sedan,1


Selecting rows by index works the same way but with `iloc`

In [19]:
# Select the first, 90th, and 178th rows by index
df.iloc[[0,89,177]]

Unnamed: 0,Age,Profit,Location,Vehicle-Type,Previous
0,21,1387,Tionesta,Sedan,0
89,46,978,Kane,Sedan,1
177,72,1640,Olean,Sedan,1


---

<div style="background:#fcfce8">
    <p style="color:navy;font-size:200%;font-family: Segoe Script, cursive">Questions to ponder ...</p>
    <p style="color:navy;text-align:center;font-size:150%">Why are the results the same?</p><p style="color:navy;text-align:center;font-size:150%">Why do we need two different ways to do the same thing?</p>
    <p> </p>
</div>

---

### Subsetting specific rows and columns at the same time
It seems like we could just combine what we've done so far to get a subset with only some rows and some columns. Maybe we could:
- Use the `loc` or `iloc` methods as appropriate.
- Rows first, then columns using lists for multiple values.
    - `df.loc[ROWS, COLUMNS]`

Questions:
1. How do we decide between `loc` and `iloc`?
2. How would we get the value from one specific cell using this technique?

Run the following cell to see how that works.

In [20]:
# Specify the first, 15th, 42nd, and 114th rows with the Age & Profit columns.
df.loc[[0,14,41,113],['Age','Profit']]

Unnamed: 0,Age,Profit
0,21,1387
14,31,870
41,39,996
113,49,1915


## Slicing syntax
If we want to subset a range of rows (or columns), we can use Python's slicing syntax with `iloc` and the index values. There are a couple of things to remember here:
- We use a zero-based integer index value.
- Python is **left INclusive, right EXclusive** when using ranges. 
- We specify the beginning and the end of the range (separated by a colon `:`) keeping the first two points in mind.

We always specify rows first, then columns.

- Specify row or column names -> use `loc[]`
- Specify row or column position number -> use `iloc[]`

The generic command using the slicing syntax and position numbers is
```
source_df.iloc[row_start:row_end, col_start:col_end]
```

The key point to remember here is that _the end value is the first value to **not** appear in the slice_. That is what "right exclusive" means.

Another feature is that the start or stop values may be negative numbers, which means Python counts from the end of the rows or columns instead of from the beginning.

We can use `iloc` to select a range of rows or columns using slicing syntax and the index number. We'll do more with this later, but for now let's select the 4th through 8th rows and keep all of the columns.

In [21]:
# Range of rows using iloc
# First, use the head method to inspect the first 10 rows for reference, then display the required rows.
# HINT: You'll need to use print() here at least once. Add a new line character for readability.
#    What else could you do to make the results of head() easier to use as a reference?
#       PUT THE COMMANDS IN SEPARATE CELLS

print(df.head(10),'\n')
df.iloc[3:8,:]

   Age  Profit   Location Vehicle-Type  Previous
0   21    1387   Tionesta        Sedan         0
1   23    1754  Sheffield          SUV         1
2   24    1817  Sheffield       Hybrid         1
3   25    1040  Sheffield      Compact         0
4   26    1273       Kane        Sedan         1
5   27    1529  Sheffield        Sedan         1
6   27    3082       Kane        Truck         0
7   28    1951       Kane          SUV         1
8   28    2692   Tionesta      Compact         0
9   29    1206  Sheffield        Sedan         0 



Unnamed: 0,Age,Profit,Location,Vehicle-Type,Previous
3,25,1040,Sheffield,Compact,0
4,26,1273,Kane,Sedan,1
5,27,1529,Sheffield,Sedan,1
6,27,3082,Kane,Truck,0
7,28,1951,Kane,SUV,1


In [22]:
# Select all rows for the last three columns using the same method.
df.iloc[:,-3:]

Unnamed: 0,Location,Vehicle-Type,Previous
0,Tionesta,Sedan,0
1,Sheffield,SUV,1
2,Sheffield,Hybrid,1
3,Sheffield,Compact,0
4,Kane,Sedan,1
5,Sheffield,Sedan,1
6,Kane,Truck,0
7,Kane,SUV,1
8,Tionesta,Compact,0
9,Sheffield,Sedan,0


In [23]:
# PRACTICE
# Slice the rows with index values from 74 to 88 and the columns Location & Vehicle-Type.

df.iloc[74:89,2:4]

Unnamed: 0,Location,Vehicle-Type
74,Tionesta,SUV
75,Kane,Sedan
76,Kane,SUV
77,Sheffield,Compact
78,Kane,SUV
79,Olean,SUV
80,Kane,Compact
81,Olean,Sedan
82,Olean,Compact
83,Olean,Compact


Keep in mind that it is usually better to use `loc[]` and the specific column names so your code is self documenting. Try that next.

In [24]:
# PRACTICE
# Create a subset of the data using the columns Vehicle-Type and Profit, selecting rows 38 through 43.
df.loc[38:44,['Vehicle-Type','Profit']]

Unnamed: 0,Vehicle-Type,Profit
38,Hybrid,2119
39,SUV,1766
40,Hybrid,2201
41,Compact,996
42,SUV,2813
43,Sedan,323
44,Compact,352


### More practice exercises:
Insert cells below here as needed to complete these exercises.
1. What is the value of Profit with row index 6?
2. Display the subset using all columns except the first one (Age) and row index 2 through 7.
3. Display the subset using row indices 4 and 5 using the first three columns.


In [25]:
# row 6 profit
df.loc[6,'Profit']

3082

In [26]:
# except first column, rows 2-7
df.iloc[2:8,-4:]

Unnamed: 0,Profit,Location,Vehicle-Type,Previous
2,1817,Sheffield,Hybrid,1
3,1040,Sheffield,Compact,0
4,1273,Kane,Sedan,1
5,1529,Sheffield,Sedan,1
6,3082,Kane,Truck,0
7,1951,Kane,SUV,1


In [27]:
# rows 4 & 5, first 3 columns
#df.iloc[4:6,:3]
# or
df.loc[[4,5],['Age','Profit','Location']]


Unnamed: 0,Age,Profit,Location
4,26,1273,Kane
5,27,1529,Sheffield


---
### Expected results
1. 3082
2. <div><img src="attachment:pandas_basics_1.PNG" width="300" align="left"></div>

3. <div><img src="attachment:pandas_basics_2.PNG" width="175" align="left"></div>

## Grouped and Aggregate Calculations, Basic Plotting
The text ends this chapter with sections on these two topics. We will look at these topics in depth later in the semester, so you don't need to worry about them now.