# Data Manipulation with pandas
Run the hidden code cell below to import the data used in this course.

In [1]:
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

## Exploring a DataFrame

In [2]:
print(avocado.head())

         date          type  year  avg_price   size     nb_sold
0  2015-12-27  conventional  2015       0.95  small  9626901.09
1  2015-12-20  conventional  2015       0.98  small  8710021.76
2  2015-12-13  conventional  2015       0.93  small  9855053.66
3  2015-12-06  conventional  2015       0.89  small  9405464.36
4  2015-11-29  conventional  2015       0.99  small  8094803.56


In [3]:
print(avocado.info())
# Shows the column types and missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1014 entries, 0 to 1013
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date       1014 non-null   object 
 1   type       1014 non-null   object 
 2   year       1014 non-null   int64  
 3   avg_price  1014 non-null   float64
 4   size       1014 non-null   object 
 5   nb_sold    1014 non-null   float64
dtypes: float64(2), int64(1), object(3)
memory usage: 47.7+ KB
None


In [4]:
# This pulls up a tuple of the number of rows and columns
print(avocado.shape)

(1014, 6)


In [5]:
# Summary Stats
print(avocado.describe())

              year    avg_price       nb_sold
count  1014.000000  1014.000000  1.014000e+03
mean   2016.147929     1.319024  4.167774e+06
std       0.940380     0.295168  5.596185e+06
min    2015.000000     0.760000  8.343000e+02
25%    2015.000000     1.040000  1.320755e+05
50%    2016.000000     1.325000  4.232327e+05
75%    2017.000000     1.540000  1.019066e+07
max    2018.000000     2.090000  2.274362e+07


DataFrames are consistent of 3 different components, and you can access them using attributes. There are values, columns, and rows (to see rows, use .index)

In [6]:
print(avocado.values)

[['2015-12-27' 'conventional' 2015 0.95 'small' 9626901.09]
 ['2015-12-20' 'conventional' 2015 0.98 'small' 8710021.76]
 ['2015-12-13' 'conventional' 2015 0.93 'small' 9855053.66]
 ...
 ['2018-01-21' 'organic' 2018 1.63 'extra_large' 1490.02]
 ['2018-01-14' 'organic' 2018 1.59 'extra_large' 1580.01]
 ['2018-01-07' 'organic' 2018 1.51 'extra_large' 1289.07]]


In [7]:
print(avocado.columns)

Index(['date', 'type', 'year', 'avg_price', 'size', 'nb_sold'], dtype='object')


In [8]:
print(avocado.index)

RangeIndex(start=0, stop=1014, step=1)


## Sorting and Subsetting

In [9]:
print(avocado.sort_values(['date', 'type'], ascending= [True, False]))
# for sorting by a field
# For descending, add "ascending = False" argument
# You can also pass a list in sort by. The first field in the list will be sorted first. 

            date          type  year  avg_price         size      nb_sold
220   2015-01-04       organic  2015       1.46        small    233286.13
558   2015-01-04       organic  2015       1.46        large    216611.20
896   2015-01-04       organic  2015       1.46  extra_large      4370.99
51    2015-01-04  conventional  2015       0.95        small  12357161.34
389   2015-01-04  conventional  2015       0.95        large  13624083.05
...          ...           ...   ...        ...          ...          ...
664   2018-03-25       organic  2018       1.55        large    342853.10
1002  2018-03-25       organic  2018       1.55  extra_large      1070.24
157   2018-03-25  conventional  2018       1.03        small  14130799.10
495   2018-03-25  conventional  2018       1.03        large  12125711.42
833   2018-03-25  conventional  2018       1.03  extra_large    758801.12

[1014 rows x 6 columns]


When subsetting multiple columns, we need 2 sets of brackets. The outer brackets are for subsetting the DataFrame, and the inner brackets make up the list of column names to subset. 

In [10]:
print(avocado[["size", 'nb_sold']])
print(avocado['nb_sold'])

             size     nb_sold
0           small  9626901.09
1           small  8710021.76
2           small  9855053.66
3           small  9405464.36
4           small  8094803.56
...           ...         ...
1009  extra_large     1703.52
1010  extra_large     1270.61
1011  extra_large     1490.02
1012  extra_large     1580.01
1013  extra_large     1289.07

[1014 rows x 2 columns]
0       9626901.09
1       8710021.76
2       9855053.66
3       9405464.36
4       8094803.56
           ...    
1009       1703.52
1010       1270.61
1011       1490.02
1012       1580.01
1013       1289.07
Name: nb_sold, Length: 1014, dtype: float64


## Subsetting Rows
There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.

In [11]:
avocado[avocado["nb_sold"] > 5000]
avocado[avocado["size"] == "small"]

Unnamed: 0,date,type,year,avg_price,size,nb_sold
0,2015-12-27,conventional,2015,0.95,small,9626901.09
1,2015-12-20,conventional,2015,0.98,small,8710021.76
2,2015-12-13,conventional,2015,0.93,small,9855053.66
3,2015-12-06,conventional,2015,0.89,small,9405464.36
4,2015-11-29,conventional,2015,0.99,small,8094803.56
...,...,...,...,...,...,...
333,2018-02-04,organic,2018,1.53,small,117922.52
334,2018-01-28,organic,2018,1.61,small,118616.17
335,2018-01-21,organic,2018,1.63,small,108705.28
336,2018-01-14,organic,2018,1.59,small,145680.62


You can filter for multiple conditions at once by using the "bitwise and" operator, &, or use bitwise or, |

In [12]:
avocado[(avocado["nb_sold"] > 5000) & (avocado["size"] == "small")]

Unnamed: 0,date,type,year,avg_price,size,nb_sold
0,2015-12-27,conventional,2015,0.95,small,9626901.09
1,2015-12-20,conventional,2015,0.98,small,8710021.76
2,2015-12-13,conventional,2015,0.93,small,9855053.66
3,2015-12-06,conventional,2015,0.89,small,9405464.36
4,2015-11-29,conventional,2015,0.99,small,8094803.56
...,...,...,...,...,...,...
333,2018-02-04,organic,2018,1.53,small,117922.52
334,2018-01-28,organic,2018,1.61,small,118616.17
335,2018-01-21,organic,2018,1.63,small,108705.28
336,2018-01-14,organic,2018,1.59,small,145680.62


### Subsetting using .isin()
This is used for subsetting multiple values of a categorical variable. It takes a list of values to filter on. 

In [13]:
is_big = avocado['size'].isin(['large', 'extra_large'])
print(avocado[is_big])

            date          type  year  avg_price         size      nb_sold
338   2015-12-27  conventional  2015       0.95        large  10197890.05
339   2015-12-20  conventional  2015       0.98        large   9329861.85
340   2015-12-13  conventional  2015       0.93        large  10805838.91
341   2015-12-06  conventional  2015       0.89        large  12160838.62
342   2015-11-29  conventional  2015       0.99        large   9003178.41
...          ...           ...   ...        ...          ...          ...
1009  2018-02-04       organic  2018       1.53  extra_large      1703.52
1010  2018-01-28       organic  2018       1.61  extra_large      1270.61
1011  2018-01-21       organic  2018       1.63  extra_large      1490.02
1012  2018-01-14       organic  2018       1.59  extra_large      1580.01
1013  2018-01-07       organic  2018       1.51  extra_large      1289.07

[676 rows x 6 columns]


## This is the bitwise or operator | 
## This if the bitwise and operator &

Subsetting data based on a categorical variable often involves using the "or" operator (|) to select rows from multiple categories.

# Summary Stats

In [14]:
# Calculate quantiles
# .quantile()

The .agg() method allows you to compute custom summary stats. You can pass a list of functions into .agg() to find multiple summary stats.

In [15]:
def pct30(column) :
    return column.quantile(0.3)

def pct40(column) :
    return column.quantile(0.4)

avocado['nb_sold'].agg(pct30)
avocado[['nb_sold', 'avg_price']].agg([pct30, pct40])

Unnamed: 0,nb_sold,avg_price
pct30,183915.73,1.07
pct40,272741.826,1.2


The .cumsum() method calculates the cumulative sum of all the rows in a DataFrame

In [16]:
avocado['nb_sold'].cumsum()

0       9.626901e+06
1       1.833692e+07
2       2.819198e+07
3       3.759744e+07
4       4.569224e+07
            ...     
1009    4.226117e+09
1010    4.226119e+09
1011    4.226120e+09
1012    4.226122e+09
1013    4.226123e+09
Name: nb_sold, Length: 1014, dtype: float64

.cummin() is the cumulative min, and there's also .cummax(). .comprod() is the cummulative product

# Counting
You can use the .drop_duplicates() method to bring up an output without duplicates. Adding a list into the subset argument gives you more specificity (if you have a list of dogs, subsetting names would get rid of all the dog named Max except for one, but including breed would keep Max the lab and Max the chihuahua)

In [17]:
avocado.drop_duplicates(subset=["date", 'type'])
dates_reported = avocado.drop_duplicates(subset='date')

To count the number of items in a category, use .value_counts(). Including the sort argument allows you to sort the results, and normalize creates an output in the form of a proportion of the whole

In [18]:
avocado['size'].value_counts(sort=True, normalize=True)

small          0.333333
large          0.333333
extra_large    0.333333
Name: size, dtype: float64

# Grouped Summary Stats
The .groupby method allows us to group by a variable, and find a summary statistic of another variable. You can also use the .agg() methoed to get mulitple stats, and group by multiple variables

In [19]:
avocado.groupby('type')['avg_price'].mean()
avocado.groupby(['type', 'size'])['avg_price'].agg([min, max])

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max
type,size,Unnamed: 2_level_1,Unnamed: 3_level_1
conventional,extra_large,0.76,1.65
conventional,large,0.76,1.65
conventional,small,0.76,1.65
organic,extra_large,1.0,2.09
organic,large,1.0,2.09
organic,small,1.0,2.09


# Pivot Tables
Pivot tables have the same functionality as the .groupby() function. Pivot tables automatically calculate the mean, but the aggfunc arguments allows you to pass a different calculation in. To group by 2 variables, use columns argument. If there are missing values, the Pivot tables field will show "NaN"; the fill_value argument lets you define what values to use where a field is missing. The margins argument creates columns and rows on the sides that show the cummulative means

In [20]:
avocado.pivot_table(values='avg_price', index='type', columns='size', aggfunc=[np.mean, np.median], fill_value=0, margins=True)

Unnamed: 0_level_0,mean,mean,mean,mean,median,median,median,median
size,extra_large,large,small,All,extra_large,large,small,All
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
conventional,1.092012,1.092012,1.092012,1.092012,1.04,1.04,1.04,1.04
organic,1.546036,1.546036,1.546036,1.546036,1.53,1.53,1.53,1.53
All,1.319024,1.319024,1.319024,1.319024,1.325,1.325,1.325,1.325


## Explicit Indexes

You can reset the index to the default numbers with the .reset_index(). The drop argument (drop=True) would completely remove the column that was used for the index from the array

You can subset the outer level with a list, but you have to subset the inner level using tuples like [('Lab', 'Brown'), ('Chihuahua, 'Tan'). The results must match all the conditions of the tuple

In [21]:
temperatures_ind = temperatures.set_index('city') # To set index
temperatures_ind_reset = temperatures_ind.reset_index() #Gets rid of index
temperatures_drop_city = temperatures_ind.reset_index(drop = True) # Gets rid of index, but also removes city

In [22]:
# Subsetting with loc
cities = ['Moscow', 'Saint Petersburg']
temperatures[temperatures['city'].isin(["Moscow", "Saint Petersburg"])] # This subset uses the city column not the index
print(temperatures_ind.loc[cities]) # You have to use the array with the index set to cities

                        date country  avg_temp_c
city                                            
Moscow            2000-01-01  Russia      -7.313
Moscow            2000-02-01  Russia      -3.551
Moscow            2000-03-01  Russia      -1.661
Moscow            2000-04-01  Russia      10.096
Moscow            2000-05-01  Russia      10.357
...                      ...     ...         ...
Saint Petersburg  2013-05-01  Russia      12.355
Saint Petersburg  2013-06-01  Russia      17.185
Saint Petersburg  2013-07-01  Russia      17.234
Saint Petersburg  2013-08-01  Russia      17.153
Saint Petersburg  2013-09-01  Russia         NaN

[330 rows x 3 columns]


Using two columns for an index is called a multi-level, or hierarchical index. There's an implication that the inner level of the index is nested inside the outer level of the index (for instance, if you had breed and color in a dog array, breed would be the outer and color the inner)

In [23]:
temperatures_ind = temperatures.set_index(['country', 'city']) # City would be nested inside country
rows_to_keep = [('Brazil', 'Rio De Janeiro'), ('Pakistan', 'Lahore')] # Passed as a tuple
print(temperatures_ind.loc[rows_to_keep])

                               date  avg_temp_c
country  city                                  
Brazil   Rio De Janeiro  2000-01-01      25.974
         Rio De Janeiro  2000-02-01      26.699
         Rio De Janeiro  2000-03-01      26.270
         Rio De Janeiro  2000-04-01      25.750
         Rio De Janeiro  2000-05-01      24.356
...                             ...         ...
Pakistan Lahore          2013-05-01      33.457
         Lahore          2013-06-01      34.456
         Lahore          2013-07-01      33.279
         Lahore          2013-08-01      31.511
         Lahore          2013-09-01         NaN

[330 rows x 2 columns]


While using .sort_index(), you can control the sorting by passing lists to the level and ascending arguments

In [24]:
temperatures_ind.sort_index()
temperatures_ind.sort_index(level='city') # sorts on the city index
print(temperatures_ind.sort_index(level=['country', 'city'], ascending=[True, False])) # sorts on country ascending, then city descending

                          date  avg_temp_c
country     city                          
Afghanistan Kabul   2000-01-01       3.326
            Kabul   2000-02-01       3.454
            Kabul   2000-03-01       9.612
            Kabul   2000-04-01      17.925
            Kabul   2000-05-01      24.658
...                        ...         ...
Zimbabwe    Harare  2013-05-01      18.298
            Harare  2013-06-01      17.020
            Harare  2013-07-01      16.299
            Harare  2013-08-01      19.232
            Harare  2013-09-01         NaN

[16500 rows x 2 columns]


## Slicing and Subsetting with .loc and .iloc

You can slice outer levels easily with .loc, but to slice the inner levels, you need to pass tuples on either side of the colon

iloc uses row and column number 

In [25]:
# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Subset rows from Pakistan to Russia
print(temperatures_srt.loc['Pakistan':'Russia'])

# Subset rows from Pakistan, Lahore to Russia, Moscow
print(temperatures_srt.loc[("Pakistan", "Lahore"):('Russia', 'Moscow')]) # subsetting an inner level, so have to pass as tuples

                                 date  avg_temp_c
country  city                                    
Pakistan Faisalabad        2000-01-01      12.792
         Faisalabad        2000-02-01      14.339
         Faisalabad        2000-03-01      20.309
         Faisalabad        2000-04-01      29.072
         Faisalabad        2000-05-01      34.845
...                               ...         ...
Russia   Saint Petersburg  2013-05-01      12.355
         Saint Petersburg  2013-06-01      17.185
         Saint Petersburg  2013-07-01      17.234
         Saint Petersburg  2013-08-01      17.153
         Saint Petersburg  2013-09-01         NaN

[1155 rows x 2 columns]
                       date  avg_temp_c
country  city                          
Pakistan Lahore  2000-01-01      12.792
         Lahore  2000-02-01      14.339
         Lahore  2000-03-01      20.309
         Lahore  2000-04-01      29.072
         Lahore  2000-05-01      34.845
...                     ...         ...
Russi

You can also slice columns using loc, where you pass enter the slice of columns as a second argument after comma. 

In [26]:
# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[("India", "Hyderabad"):('Iraq', 'Baghdad')])

# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:, 'date':'avg_temp_c'])

# Subset in both directions at once, pretty much combining both slices above
print(temperatures_srt.loc[("India", "Hyderabad"):('Iraq', 'Baghdad'), 'date':'avg_temp_c'])

                         date  avg_temp_c
country city                             
India   Hyderabad  2000-01-01      23.779
        Hyderabad  2000-02-01      25.826
        Hyderabad  2000-03-01      28.821
        Hyderabad  2000-04-01      32.698
        Hyderabad  2000-05-01      32.438
...                       ...         ...
Iraq    Baghdad    2013-05-01      28.673
        Baghdad    2013-06-01      33.803
        Baghdad    2013-07-01      36.392
        Baghdad    2013-08-01      35.463
        Baghdad    2013-09-01         NaN

[2145 rows x 2 columns]
                          date  avg_temp_c
country     city                          
Afghanistan Kabul   2000-01-01       3.326
            Kabul   2000-02-01       3.454
            Kabul   2000-03-01       9.612
            Kabul   2000-04-01      17.925
            Kabul   2000-05-01      24.658
...                        ...         ...
Zimbabwe    Harare  2013-05-01      18.298
            Harare  2013-06-01      17.020

When you're slicing using dates, you can use simplified dates. If the dates in the column or index are long form, you can use the year ("2016")

In [27]:
# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
temperatures_bool = temperatures[(temperatures['date'] >= '2010-01-01') & (temperatures['date'] <= '2011-12-31')]
print(temperatures_bool)

# Set date as the index and sort the index
temperatures_ind = temperatures.set_index('date').sort_index()

# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc['2010':'2011'])

# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc['2010-08':'2011-02'])

             date     city        country  avg_temp_c
120    2010-01-01  Abidjan  Côte D'Ivoire      28.270
121    2010-02-01  Abidjan  Côte D'Ivoire      29.262
122    2010-03-01  Abidjan  Côte D'Ivoire      29.596
123    2010-04-01  Abidjan  Côte D'Ivoire      29.068
124    2010-05-01  Abidjan  Côte D'Ivoire      28.258
...           ...      ...            ...         ...
16474  2011-08-01     Xian          China      23.069
16475  2011-09-01     Xian          China      16.775
16476  2011-10-01     Xian          China      12.587
16477  2011-11-01     Xian          China       7.543
16478  2011-12-01     Xian          China      -0.490

[2400 rows x 4 columns]
                  city    country  avg_temp_c
date                                         
2010-01-01  Faisalabad   Pakistan      11.810
2010-01-01   Melbourne  Australia      20.016
2010-01-01   Chongqing      China       7.921
2010-01-01   São Paulo     Brazil      23.738
2010-01-01   Guangzhou      China      14.136
...  

## Pivot Tables

Call using .pivot_table() function. The first argument is the aggregate values, index= sets the row values, and columns= sets the columns. The default aggregate function is the mean

The methods for calculating summary stats on DataFrame, like .mean(), have an axis argument. The default value for axis is index, so it calculats the mean across rows. To calculate stats for each row (aka, across the columns), set axis to columns. For most DataFrames, setting the axis arguments doesn't make sense, since you'll have different data types in each column, but they're potentially useful with PivotTables

In [None]:
# Add a year column to temperatures
temperatures['year'] = temperatures['date'].dt.year

# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table('avg_temp_c', index=['country', 'city'], columns='year')

# See the result
print(temp_by_country_city_vs_year)

AttributeError: Can only use .dt accessor with datetimelike values

PivotTables are DataFrames with sorted indexes, so you can use .loc[] and slicing methods for subsetting PivotTables

In [None]:
# Subset for Egypt to India
temp_by_country_city_vs_year.loc["Egypt":"India"]

# Subset for Egypt, Cairo to India, Delhi
temp_by_country_city_vs_year.loc[('Egypt', 'Cairo'):('India', 'Delhi')]

# Subset for Egypt, Cairo to India, Delhi, and 2005 to 2010
temp_by_country_city_vs_year.loc[('Egypt', 'Cairo'):('India', 'Delhi'), '2005':'2010']

In [None]:
# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean(axis='index') # finds the mean within columns

# Filter for the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])

# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis='columns') # finds the mean within rows

# Filter for the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])

## Visualizing Data

To create a barplot, first use .groupby() function to group by variable of interest, select the column with the values of interest, and find the mean or other calculation of interest. Then use .plot(kind=bar), and call plt.show()

In [3]:
# Look at the first few rows of data
print(avocado.head())

# Get the total number of avocados sold of each size
nb_sold_by_size = avocado.groupby("size")["nb_sold"].sum()

# Create a bar plot of the number of avocados sold by size
nb_sold_by_size.plot(kind="bar")

# Show the plot
plt.show()

For a line plot, set the x= and y= arguments in .plot(), and set kind='line'. You can also rotate the x-axis label using rot=; rot-45 shows labels at a 45 degree angle

In [4]:
# Get the total number of avocados sold on each date
nb_sold_by_date = avocado.groupby('date')['nb_sold'].sum()

# Create a line plot of the number of avocados sold by date
nb_sold_by_date.plot(x='date', y='nb_sold', kind='line', rot=45)

# Show the plot
plt.show()

For scatter, set kind='scatter'

In [5]:
# Scatter plot of avg_price vs. nb_sold with title
avocado.plot(x="nb_sold", y="avg_price", kind="scatter", title="Number of avocados sold vs. average price")

# Show the plot
plt.show()

Plots care be layered by subsetting variables of interest in two separate lines and calling the function on each, and then calling plt.show(). Pass plt.legend() with a list of legend names in respective order before plt.show() to add a legend. To change the transparency of the plot, set the .hist() agrument alpha=, where 0 is invisible and 1 is completely opaque. Don't forget the bins= argument in the .hist() function to set the number of bins

In [6]:
# First bracket subsets for avocades, then for the average price column
avocado[avocado["type"] == "conventional"]["avg_price"].hist(alpha=0.5, bins=20)
avocado[avocado["type"] == "organic"]["avg_price"].hist(alpha=0.5, bins=20)

# Add a legend
plt.legend(["conventional", "organic"])

# Show the plot
plt.show()

## Missing Values

NaN = Not a number

Use the .isna() method to bring up a DataFrame that shows a Boolean value for every value to indicate the presence of missing values

Using .isna().any() shows a more concise report, giving one Boolean value for each variable indicating missing values. .isna().sum() will count the number of Trues, and will show the number of missign values for each variable. isna().sum().plot(kind='bar') will give a bar chart of missing values

.fillna() will replace the NaN with a particular value. .fillna(0) will replace each NaN with 0

In [None]:
# Check individual values for missing values
print(avocados_2016.isna())

# Check each column for missing values
print(avocados_2016.isna().any())

# Bar plot of missing values by variable
avocados_2016.isna().sum().plot(kind="bar")

# Show plot
plt.show()

.dropna() will remove the rows that have NaN values

In [None]:
# Remove rows with missing values
avocados_complete = avocados_2016.dropna()

# Check if any columns contain missing values
print(avocados_complete.isna().any())

## Creating DataFrames

DataFrames can be created from a list of dictionaries, or a dictionary of lists (but may also be created other ways)

pd.DataFrame() function will convert the below instances into a DataFrame

With creating from a list of dictionaries the DF is created row by row. The list is surrounded by square brackets, which enclose mutiple dictionaries, with each dicitionary holding the contents of one row. The key will represent the variable/column name, with the corresponding value pair, both separated by a colon. 

In [3]:
# Create a list of dictionaries with new data
avocados_list = [
    {"date": "2019-11-03", "small_sold": 10376832, "large_sold": 7835071},
    {"date": "2019-11-10", "small_sold": 10717154, "large_sold": 8561348},
]

# Convert list into DataFrame
avocados_2019 = pd.DataFrame(avocados_list)

# Print the new DataFrame
print(avocados_2019)

With creating from a dictionary of lists, it's created column by column. The dictionary will be enclosed by curly brackets, where the key will represent a column name, followed by a list of values (from top to bottom) as its value pair, separated by a colon.

In [4]:
# Create a dictionary of lists with new data
avocados_dict = {
  "date": ["2019-11-17", "2019-12-01"],
  "small_sold": [10859987, 9291631],
  "large_sold": [7674135, 6238096]
}

# Convert dictionary into DataFrame
avocados_2019 = pd.DataFrame(avocados_dict)

# Print the new DataFrame
print(avocados_2019)

## Reading and Writing Datasets

pd.read_csv("file_path.csv") will convert .csv into a DataFrame. Pass the file path as a string as the argument

In [None]:
# Read CSV as DataFrame called airline_bumping
airline_bumping = pd.read_csv("airline_bumping.csv")

You can manipulate the DataFrame to add new columns, etc.

To convert the DF to .csv, use df.to_csv('New_file_path.csv'). Pass the new file path as a string as the argument

In [None]:
# Create airline_totals_sorted
airline_totals_sorted = airline_totals.sort_values("bumps_per_10k", ascending=False)

# Print airline_totals_sorted
print(airline_totals_sorted)

# Save as airline_totals_sorted.csv
airline_totals_sorted.to_csv("airline_totals_sorted.csv")

## Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- Print the highest weekly sales for each `department` in the `walmart` DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this [video](https://campus.datacamp.com/courses/data-manipulation-with-pandas/aggregating-dataframes?ex=1).
- What was the total `nb_sold` of organic avocados in 2017 in the `avocado` DataFrame? If you're stuck, try reviewing this [video](https://campus.datacamp.com/courses/data-manipulation-with-pandas/slicing-and-indexing-dataframes?ex=6).
- Create a bar plot of the total number of homeless people by region in the `homelessness` DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this [video](https://campus.datacamp.com/courses/data-manipulation-with-pandas/creating-and-visualizing-dataframes?ex=1).
- Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this [video](https://campus.datacamp.com/courses/data-manipulation-with-pandas/creating-and-visualizing-dataframes?ex=1).