### Pandas Lab -- Basic Selecting & Querying

This lab walks you through various sections of Pandas syntax for grabbing & selecting data.

The lab is broken down into three parts, and will be completed throughout class.

 - 1. Basic selectors with Pandas
 - 2. Selecting based on conditions & boolean indexes
 - 3. Special commands for selecting certain types of rows

In [11]:
import pandas as pd
import numpy as np

In [10]:
restaurants = pd.read_csv(r"C:\Users\chloe\Data Science\DAT-1019-Chloe\ClassMaterial\Unit2\data\restaurants.csv")

NameError: name 'pd' is not defined

In [3]:
restaurants.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,


### Section 1:  Selecting Data With Pandas

**1). What is the average number of visitors througout the entire dataset?**

In [4]:
restaurants['visitors'].mean()

20.973761245180636

**2). What are the median values of the visitors and holiday columns?**

In [12]:
# your answer here
restaurants[['visitors','holiday']].median()

visitors    17.0
holiday      0.0
dtype: float64

**3). What was the lowest number of visitors among the first 5000 rows in the dataset?**

In [14]:
restaurants['visitors'][:5000].min()

1

**4). What is the modal value of the last 4 columns in the dataset?**

In [10]:
# your answer here
restaurants.iloc[:,:4].mode()

Unnamed: 0,id,visit_date,visitors,calendar_date
0,air_5c817ef28f236bdf,2017-03-17,8,2017-03-17


**5). What is the mean value of the first 250 rows of the first 3 columns in the dataset?**

In [9]:
# your answer here
restaurants.iloc[:250, :3].mean()

visitors    24.912
dtype: float64

### Section II: Selecting Based on Conditions

**1). What was the average attendance on Monday?  On the weekend (Saturday & Sunday)?**

In [22]:
df = restaurants
print(f"Monday Mean: {df[df['day_of_week'] == 'Monday']['visitors'].mean()}")
print(f"Weekend Mean: {df[(df['day_of_week'] == 'Saturday') | (df['day_of_week'] == 'Sunday')]['visitors'].mean()}")

Monday Mean: 17.177009027207877
Weekend Mean: 25.256869738495084


**2). Is attendance higher on average for holidays or non-holidays?**

In [28]:
# your answer here
print(f"Mean Holiday Attendance: {df[df['holiday'] != 0]['visitors'].mean()}")
print(f"Mean Non-holiday Attendance: {df[df['holiday'] == 0]['visitors'].mean()}")

Mean Holiday Attendance: 23.703326810176126
Mean Non-holiday Attendance: 20.828063827386945


**3). What was the highest day of attendance for Dining Bars?**

In [35]:
# your answer here -- notice the different way of selecting

df[df['visitors'] == (df[df['genre'] == 'Dining bar']['visitors'].max())]['calendar_date']

245791    2017-01-23
Name: calendar_date, dtype: object

**4). What was the date that had the highest number of reservations that was a holiday?  Hint:  use the `idxmax()` function**

In [49]:
# your answer here
df[df['reserve_visitors'] == df[df['holiday'] != 0]['reserve_visitors'].max()]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
1503,air_64d4491ad8cdb1c6,2016-12-30,23,2016-12-30,Friday,1,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,58.0
1880,air_ee3a01f0c71a769f,2016-12-30,53,2016-12-30,Friday,1,Cafe/Sweets,Shizuoka-ken Hamamatsu-shi Motoshirochō,34.710895,137.725940,58.0
2337,air_9438d67241c81314,2016-12-30,28,2016-12-30,Friday,1,Italian/French,Fukuoka-ken Fukuoka-shi Daimyō,33.589216,130.392813,58.0
2805,air_d0e8a085d8dc83aa,2016-12-30,12,2016-12-30,Friday,1,Cafe/Sweets,Hyōgo-ken Kōbe-shi Sumiyoshi Higashimachi,34.720228,135.265455,58.0
3246,air_5c65468938c07fa5,2016-12-30,7,2016-12-30,Friday,1,Other,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,58.0
...,...,...,...,...,...,...,...,...,...,...,...
250875,air_0164b9927d20bcc3,2016-12-30,21,2016-12-30,Friday,1,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,58.0
250992,air_965b2e0cf4119003,2016-12-30,26,2016-12-30,Friday,1,Izakaya,Tōkyō-to Meguro-ku Kamimeguro,35.641463,139.698171,58.0
251149,air_a257c9749d8d0ff6,2016-12-30,42,2016-12-30,Friday,1,Izakaya,Hokkaidō Asahikawa-shi 6 Jōdōri,43.770635,142.364819,58.0
251295,air_e00fe7853c0100d6,2016-12-30,28,2016-12-30,Friday,1,Izakaya,Hyōgo-ken Kakogawa-shi Kakogawachō Kitazaike,34.756950,134.841177,58.0


In [61]:
df.iloc[df[df.holiday == 1]['visitors'].idxmax()]

id                                     air_df554c4527a1cfe6
visit_date                                       2016-12-30
visitors                                                205
calendar_date                                    2016-12-30
day_of_week                                          Friday
holiday                                                   1
genre                                               Izakaya
area                Shizuoka-ken Hamamatsu-shi Motoshirochō
latitude                                            34.7109
longitude                                           137.726
reserve_visitors                                         58
Name: 122871, dtype: object

**Section III: Special Types of Selectors**

To get some additional practice using common Pandas methods, we'll go over some common scenarios you typically have to select data for. 

*The methods used in this section have not been covered in class.*  Each question will come with the recommended method to use.  It's best to use the `?` before the method to read how it works and figure out how to use it.  

It's designed to be a little bit of a treasure hunt to familiarize yourself with a lot of the bread & butter pandas methods.

**1). Can you return the amount of null values for each column?**

To use: `df.isnull()`.  **Hint:** `True` sums to 1, `False` to 0.

In [12]:
# your answer here
import pandas as pd
df = pd.read_csv(r"C:\Users\chloe\Data Science\DAT-1019-Chloe\ClassMaterial\Unit2\data\restaurants.csv")

In [38]:
df[df.isnull().any(axis=1)]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
...,...,...,...,...,...,...,...,...,...,...,...
252087,air_a17f0778617c76e2,2017-04-04,10,2017-04-04,Tuesday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,
252092,air_a17f0778617c76e2,2017-04-10,28,2017-04-10,Monday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,
252099,air_a17f0778617c76e2,2017-04-17,19,2017-04-17,Monday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,
252100,air_a17f0778617c76e2,2017-04-18,11,2017-04-18,Tuesday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,


In [39]:
df['genre'].value_counts()

Izakaya                         62052
Cafe/Sweets                     52764
Dining bar                      34192
Italian/French                  30011
Bar/Cocktail                    25135
Japanese food                   18789
Other                            8246
Yakiniku/Korean food             7025
Western food                     4897
Creative cuisine                 3868
Okonomiyaki/Monja/Teppanyaki     3706
Asian                             535
Karaoke/Party                     516
International cuisine             372
Name: genre, dtype: int64

In [40]:
?df.genre.value_counts

**2). Can you find the count values for every single unique value within a column?**

To use: `pd.Series.value_counts()`.  **Hint:** This is a *Series* method, not a *Dataframe* method.  

In [None]:
# your answer here

**3). Can you find the column with the highest number of unique values?  Can you sort columns their number of unique values?**

To use: `df.nunique`, and `df.sort_values()` if you want to sort it.

In [None]:
# your answer here

**4). Can you query your dataframe so that it only returns columns that have empty values?**

To use: `df.isnull()`, `df.loc`

In [None]:
# your answer here

**5).  Can you query the dataframe such that it only returns rows that have *no* missing values, in any of their columns?**

To use: `df.isnull()`, `df.any()`, or, conversely, `df.notnull()`, and `df.all()`

**Hint:** The `~` operator, if put in front of a query, selects for values that are **not** True.

In [None]:
# your answer here

**6).  Can you find rows that contain duplicate values?**

To use:  `df.duplicated()`

In [None]:
# your answer here

**7). Can you find rows that contain duplicated values for the visitors and date columns?**  

To use: `df.duplicated()`

In [None]:
# your answer here

**8).  Can you only select columns that are text based?**

To use: `df.select_dtypes()`, and (optionally) the `columns` attribute.  **Note:** `columns` is NOT a method!

In [None]:
# your answer here

**9).  Can you only select columns that are numeric?**

To use: `df.select_dtypes()`.  This question is very similar to the one above it, just for a different data type.

In [None]:
# your answer here

**10). Can you fill in the missing values of your numeric columns with their average value?**

To use: `df.fillna()`, to be used in conjunction with the suggested methods from question 11.

In [None]:
# your answer here

**11). Can you select all the rows between Jan. 1 2016 & June 30, 2016?**

In [None]:
# your answer here

**12).  Can you determine the quarter of the year for each reservation?  The month?**

In [None]:
# we can get the quarters using the dt attribute