### Pandas Lab -- Basic Selecting & Querying

This lab walks you through various sections of Pandas syntax for grabbing & selecting data.

The lab is broken down into three parts, and will be completed throughout class.

 - 1. Basic selectors with Pandas
 - 2. Selecting based on conditions & boolean indexes
 - 3. Special commands for selecting certain types of rows

In [1]:
#uploading data and importing packages 
import numpy as np 
import pandas as pd

url = r"/Users/ethanalter/Dropbox (Personal)/GA-4K-DataScience/gazelle-4K/data/master.csv"
df = pd.read_csv(url, parse)

### Section 1:  Selecting Data With Pandas

**1). What is the average number of visitors througout the entire dataset?**

In [3]:
df['visitors'].mean()

20.973761245180636

**2). What are the median values of the visitors and holiday columns?**

In [4]:
df[['visitors', 'holiday']].median()

visitors    17.0
holiday      0.0
dtype: float64

**3). What was the lowest number of visitors among the first 5000 rows in the dataset?**

In [5]:
df.head(5000)['visitors'].min()

1

In [20]:
#another way to do this - only works if index is numeric - remember that loc is looking at values!
df.loc[:5000, 'visitors'].min()

1

In [None]:
#can change the index with df.set_index('id')

**4). What is the modal value of the last 4 columns in the dataset?**

In [6]:
#[rows,columns]
df.iloc[:,-4:].mode()

Unnamed: 0,area,latitude,longitude,reserve_visitors
0,Fukuoka-ken Fukuoka-shi Daimyō,33.589216,130.392813,2.0


**5). What is the mean value of the first 250 rows of the first 3 columns in the dataset?**

In [7]:
df.iloc[:250, :3].mean()

visitors    24.912
dtype: float64

note that the formula only applies to the numeric values 

### Section II: Selecting Based on Conditions

**1). What was the average attendance on Monday?  On the weekend (Saturday & Sunday)?**

In [17]:
df.columns


Index(['id', 'visit_date', 'visitors', 'day_of_week', 'holiday', 'genre',
       'area', 'latitude', 'longitude', 'reserve_visitors'],
      dtype='object')

In [16]:
#Monday
df[df['day_of_week'] == 'Monday']['visitors'].mean()

17.177009027207877

In [49]:
df[df['day_of_week'].isin(['Saturday', 'Sunday'])]['visitors'].mean()

25.256869738495084

In [9]:
# slightly more straight forward way

**2). Is attendance higher on average for holidays or non-holidays?**

In [41]:
if (df[df['holiday'] == 1]['visitors'].mean()) > (df[df['holiday'] == 0]['visitors'].mean()):
    print("Attendance higher on holidays")
else: 
    print("Attendance higher on non-holidays")

Attendance higher on holidays


I'm not sure why this is not working...

**3). What was the highest day of attendance for Dining Bars?**

In [45]:
df[df['genre'] == "Dining bar"].sort_values(by=['visitors'], ascending = 0)[:1]['visit_date']

245791    2017-01-23
Name: visit_date, dtype: object

In [51]:
#another way to do it 
idx = df[df['genre'] == 'Dining bar']['visitors'].idxmax() #find index of the max row
df.iloc[idx] #use df.iloc[idx] to return the actual row

id                           air_c6aa2efba0ffc8eb
visit_date                             2017-01-23
visitors                                      348
day_of_week                                Monday
holiday                                         0
genre                                  Dining bar
area                Tōkyō-to Adachi-ku Chūōhonchō
latitude                                  35.7757
longitude                                 139.804
reserve_visitors                               25
Name: 245791, dtype: object

**4). What was the date that had the highest number of reservations that was a holiday?  Hint:  use the `idxmax()` function**

In [55]:
idx = df[df['holiday'] == 1]['reserve_visitors'].idxmax()
df.iloc[idx]

id                          air_64d4491ad8cdb1c6
visit_date                            2016-12-30
visitors                                      23
day_of_week                               Friday
holiday                                        1
genre                                 Dining bar
area                Tōkyō-to Minato-ku Shibakōen
latitude                                 35.6581
longitude                                139.752
reserve_visitors                              58
Name: 1503, dtype: object

In [15]:
# get the index position

**Section III: Special Types of Selectors**

To get some additional practice using common Pandas methods, we'll go over some common scenarios you typically have to select data for. 

*The methods used in this section have not been covered in class.*  Each question will come with the recommended method to use.  It's best to use the `?` before the method to read how it works and figure out how to use it.  

It's designed to be a little bit of a treasure hunt to familiarize yourself with a lot of the bread & butter pandas methods.

**1). Can you return the amount of null values for each column?**

To use: `df.isnull()`.  **Hint:** `True` sums to 1, `False` to 0.

In [3]:
df.isnull().sum(axis=0)

id                       0
visit_date               0
visitors                 0
day_of_week              0
holiday                  0
genre                    0
area                     0
latitude                 0
longitude                0
reserve_visitors    143714
dtype: int64

**2). Can you find the count values for every single unique value within a column?**

To use: `pd.Series.value_counts()`.  **Hint:** This is a *Series* method, not a *Dataframe* method.  

In [5]:
df['genre'].value_counts()

Izakaya                         62052
Cafe/Sweets                     52764
Dining bar                      34192
Italian/French                  30011
Bar/Cocktail                    25135
Japanese food                   18789
Other                            8246
Yakiniku/Korean food             7025
Western food                     4897
Creative cuisine                 3868
Okonomiyaki/Monja/Teppanyaki     3706
Asian                             535
Karaoke/Party                     516
International cuisine             372
Name: genre, dtype: int64

**3). Can you find the column with the highest number of unique values?  Can you sort columns their number of unique values?**

To use: `df.nunique`, and `df.sort_values()` if you want to sort it.

In [26]:
df.nunique().sort_values(ascending=False)

id                  829
visit_date          478
visitors            204
longitude           108
latitude            108
area                103
reserve_visitors     49
genre                14
day_of_week           7
holiday               2
dtype: int64

In [None]:
#more sorting
df.sort_values(by = ['visit_date', 'visitors'], ascending = [True, False])

**4). Can you query your dataframe so that it only returns columns that have empty values?**

To use: `df.isnull()`, `df.loc`

In [19]:
df.loc[:,df.isnull().sum(axis=0)>0]

Unnamed: 0,reserve_visitors
0,
1,
2,
3,
4,
...,...
252103,6.0
252104,37.0
252105,35.0
252106,3.0


In [28]:
#more beautiful way + storing it 
empty_cols = df.loc[:,df.isnull().sum(axis=0)>0].columns.tolist()

In [29]:
empty_cols

['reserve_visitors']

**5).  Can you query the dataframe such that it only returns rows that have *no* missing values, in any of their columns?**

To use: `df.isnull()`, `df.any()`, or, conversely, `df.notnull()`, and `df.all()`

**Hint:** The `~` operator, if put in front of a query, selects for values that are **not** True.

In [24]:
df.loc[df.isnull().sum(axis=1)==0,:]

Unnamed: 0,id,visit_date,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
11,air_ba937bf13d40fb24,2016-01-26,11,Tuesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
21,air_ba937bf13d40fb24,2016-02-09,15,Tuesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,7.0
24,air_ba937bf13d40fb24,2016-02-12,26,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,18.0
25,air_ba937bf13d40fb24,2016-02-13,8,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
37,air_ba937bf13d40fb24,2016-02-27,23,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0
252104,air_a17f0778617c76e2,2017-04-22,60,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0
252105,air_a17f0778617c76e2,2017-03-26,69,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0
252106,air_a17f0778617c76e2,2017-03-20,31,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0


In [31]:
#another way 
df.notnull().all(axis=1)

0         False
1         False
2         False
3         False
4         False
          ...  
252103     True
252104     True
252105     True
252106     True
252107     True
Length: 252108, dtype: bool

all() and any() are good boolean checks in Python! 

**6).  Can you find rows that contain duplicate values?**

To use:  `df.duplicated()`

In [32]:
df[df.duplicated()]
#looks like no duplicate rows 

Unnamed: 0,id,visit_date,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors


**7). Can you find rows that contain duplicated values for the visitors and date columns?**  

To use: `df.duplicated()`

In [35]:
df[df.duplicated(subset = ['visit_date', 'visitors'])]

Unnamed: 0,id,visit_date,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
416,air_25e9888d30b386df,2016-02-18,22,Thursday,0,Izakaya,Tōkyō-to Shinagawa-ku Higashigotanda,35.626568,139.725858,
424,air_25e9888d30b386df,2016-03-02,21,Wednesday,0,Izakaya,Tōkyō-to Shinagawa-ku Higashigotanda,35.626568,139.725858,
442,air_25e9888d30b386df,2016-03-27,1,Sunday,0,Izakaya,Tōkyō-to Shinagawa-ku Higashigotanda,35.626568,139.725858,
654,air_25e9888d30b386df,2017-04-16,1,Sunday,0,Izakaya,Tōkyō-to Shinagawa-ku Higashigotanda,35.626568,139.725858,7.0
726,air_fd6aac1043520e83,2016-01-30,12,Saturday,0,Izakaya,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
...,...,...,...,...,...,...,...,...,...,...
252102,air_a17f0778617c76e2,2017-04-20,22,Thursday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,1.0
252103,air_a17f0778617c76e2,2017-04-21,49,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0
252105,air_a17f0778617c76e2,2017-03-26,69,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0
252106,air_a17f0778617c76e2,2017-03-20,31,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0


**8).  Can you only select columns that are text based?**

To use: `df.select_dtypes()`, and (optionally) the `columns` attribute.  **Note:** `columns` is NOT a method!

In [37]:
df.select_dtypes(include = np.object)
#include is a very important argument
# be careful - numpy comes with its own datatypes 

Unnamed: 0,id,visit_date,day_of_week,genre,area
0,air_ba937bf13d40fb24,2016-01-13,Wednesday,Dining bar,Tōkyō-to Minato-ku Shibakōen
1,air_ba937bf13d40fb24,2016-01-14,Thursday,Dining bar,Tōkyō-to Minato-ku Shibakōen
2,air_ba937bf13d40fb24,2016-01-15,Friday,Dining bar,Tōkyō-to Minato-ku Shibakōen
3,air_ba937bf13d40fb24,2016-01-16,Saturday,Dining bar,Tōkyō-to Minato-ku Shibakōen
4,air_ba937bf13d40fb24,2016-01-18,Monday,Dining bar,Tōkyō-to Minato-ku Shibakōen
...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,Friday,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri
252104,air_a17f0778617c76e2,2017-04-22,Saturday,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri
252105,air_a17f0778617c76e2,2017-03-26,Sunday,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri
252106,air_a17f0778617c76e2,2017-03-20,Monday,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri


In [38]:
# use df.info() to check the dtypes 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                252108 non-null  object 
 1   visit_date        252108 non-null  object 
 2   visitors          252108 non-null  int64  
 3   day_of_week       252108 non-null  object 
 4   holiday           252108 non-null  int64  
 5   genre             252108 non-null  object 
 6   area              252108 non-null  object 
 7   latitude          252108 non-null  float64
 8   longitude         252108 non-null  float64
 9   reserve_visitors  108394 non-null  float64
dtypes: float64(3), int64(2), object(5)
memory usage: 19.2+ MB


**9).  Can you only select columns that are numeric?**

To use: `df.select_dtypes()`.  This question is very similar to the one above it, just for a different data type.

In [41]:
numeric_cols = df.select_dtypes(exclude = np.object)

**10). Can you fill in the missing values of your numeric columns with their average value?**

To use: `df.fillna()`, to be used in conjunction with the suggested methods from question 11.

In [26]:
df[num_cols].fillna(df[num_cols].mean())
#this fills the NAs with the means 

**11). Can you select all the rows between Jan. 1 2016 & June 30, 2016?**

In [43]:
df[df['visit_date'].between('2016-01-01', '2016-06-30')]
#nice! so clean 

Unnamed: 0,id,visit_date,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
...,...,...,...,...,...,...,...,...,...,...
126441,air_764f71040a413d4d,2016-06-12,71,Sunday,0,Asian,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,
126442,air_764f71040a413d4d,2016-06-19,75,Sunday,0,Asian,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,
126443,air_764f71040a413d4d,2016-06-26,56,Sunday,0,Asian,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,
126479,air_764f71040a413d4d,2016-05-29,73,Sunday,0,Asian,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,


**12).  Can you determine the quarter of the year for each reservation?  The month?**

In [46]:
df['visit_date'].dt.quarter

AttributeError: Can only use .dt accessor with datetimelike values

In [47]:
df['visit_date'] = pd.to_datetime(df['visit_date'])

In [49]:
df['visit_date'].dt.quarter

0         1
1         1
2         1
3         1
4         1
         ..
252103    2
252104    2
252105    1
252106    1
252107    2
Name: visit_date, Length: 252108, dtype: int64