# Module 3: From `datascience` to `pandas`: II

In [None]:
!pip install datascience
import pandas as pd 
from datascience import *
import numpy as np


Collecting datascience
  Downloading datascience-0.16.1.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 999 kB/s eta 0:00:011
[?25hCollecting folium>=0.9.1
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 1.6 MB/s eta 0:00:011
[?25hCollecting sphinx
  Downloading Sphinx-3.2.1-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 21.3 MB/s eta 0:00:01
Collecting pytest
  Downloading pytest-6.0.2-py3-none-any.whl (270 kB)
[K     |████████████████████████████████| 270 kB 38.9 MB/s eta 0:00:01
[?25hCollecting coverage
  Downloading coverage-5.3-cp37-cp37m-manylinux1_x86_64.whl (229 kB)
[K     |████████████████████████████████| 229 kB 43.6 MB/s eta 0:00:01
[?25hCollecting coveralls
  Downloading coveralls-2.1.2-py2.py3-none-any.whl (13 kB)
Collecting bokeh
  Downloading bokeh-2.2.1.tar.gz (8.8 MB)
[K     |████████████████████████████████| 8.8 MB 32.0 MB/s eta 0:00:01�██████████▍            

In this notebook, we will be working with the `cones` table, a very small table about ice cream flavors used simply for pedagogical purposes. 
The table is loaded using both `datascience` and `pandas` below.

In [None]:
cones_tbl = Table().read_table('cones.csv')
cones_df = pd.read_csv('cones.csv')

## Inserting New Columns

First, let's take a look at the `cones` dataframe in `pandas` to see what we are working with. 

In [None]:
cones_df

Unnamed: 0,Flavor,Color,Price
0,strawberry,pink,3.55
1,chocolate,light brown,4.75
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25
5,bubblegum,pink,4.75


We have new information about the types of cones for each ice cream! 
Suppose your friend tells you information on what types of cones each ice cream comes with. 
We want to add this as a column to our data. 

In [None]:
type_of_cone = ["Waffle", "Sugar", "Sugar", "Waffle", "Waffle", "Sugar"]

We can add the new data as a column to the `cones` table by using the `tbl.with_columns()` function in the `datascience` package.


In [None]:
cones_tbl = cones_tbl.with_columns("Type of Cone", type_of_cone)
cones_tbl

Flavor,Color,Price,Type of Cone
strawberry,pink,3.55,Waffle
chocolate,light brown,4.75,Sugar
chocolate,dark brown,5.25,Sugar
strawberry,pink,5.25,Waffle
chocolate,dark brown,5.25,Waffle
bubblegum,pink,4.75,Sugar


There are many ways to add a new column to a dataframe in `pandas` — the easiest way is shown below. 
The string inside the bracket denotes the new column name, and you assign it to the list containing the new column. 
 

In [None]:
cones_df["Type of Cone"] = type_of_cone
cones_df

Unnamed: 0,Flavor,Color,Price,Type of Cone
0,strawberry,pink,3.55,Waffle
1,chocolate,light brown,4.75,Sugar
2,chocolate,dark brown,5.25,Sugar
3,strawberry,pink,5.25,Waffle
4,chocolate,dark brown,5.25,Waffle
5,bubblegum,pink,4.75,Sugar


Some other methods of adding a new column include `df.insert(index_at, col_name, new_col_data)`.
Here the `index_at` argument specifies the index where you want to insert the column at. 

If you are interested in familiarizing yourself with more methods to add columns into existing data frames, 
check out [this website](https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/).

In [None]:
cones_df.insert(2, "Type of Cone2", type_of_cone) 
cones_df

Unnamed: 0,Flavor,Color,Type of Cone2,Price,Type of Cone
0,strawberry,pink,Waffle,3.55,Waffle
1,chocolate,light brown,Sugar,4.75,Sugar
2,chocolate,dark brown,Sugar,5.25,Sugar
3,strawberry,pink,Waffle,5.25,Waffle
4,chocolate,dark brown,Waffle,5.25,Waffle
5,bubblegum,pink,Sugar,4.75,Sugar


## Dropping Columns

Sometimes the dataframe will contain information that you are not interested in. 
In that case, we may want to drop the irrelevant columns since they could clutter up your table.
With `datascience`, the function to drop a column is `tbl.drop(col_name)`.


In [None]:
cones_tbl.drop("Type of Cone")

Flavor,Color,Price
strawberry,pink,3.55
chocolate,light brown,4.75
chocolate,dark brown,5.25
strawberry,pink,5.25
chocolate,dark brown,5.25
bubblegum,pink,4.75


One thing to note is that the drop function in `datascience`, like most other functions, does not actually change the original table.

In [None]:
# the cones_tbl is unchanged
cones_tbl

Flavor,Color,Price,Type of Cone
strawberry,pink,3.55,Waffle
chocolate,light brown,4.75,Sugar
chocolate,dark brown,5.25,Sugar
strawberry,pink,5.25,Waffle
chocolate,dark brown,5.25,Waffle
bubblegum,pink,4.75,Sugar


To make changes to the original variable, we need to reassign the new table returned by `tbl.drop` to the original name.

In [None]:
# now the cones_tbl is changed
cones_tbl = cones_tbl.drop("Type of Cone")
cones_tbl

Flavor,Color,Price
strawberry,pink,3.55
chocolate,light brown,4.75
chocolate,dark brown,5.25
strawberry,pink,5.25
chocolate,dark brown,5.25
bubblegum,pink,4.75


To do the same procedure in `pandas`, we use [`df.drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html). 
However, one main difference is that we would need to specify that we want to drop columns, since `df.drop` can also drop rows. 

In [None]:
# the drop function without any other arguments does not change the original dataframe
cones_df.drop(columns=['Type of Cone'])

Unnamed: 0,Flavor,Color,Type of Cone2,Price
0,strawberry,pink,Waffle,3.55
1,chocolate,light brown,Sugar,4.75
2,chocolate,dark brown,Sugar,5.25
3,strawberry,pink,Waffle,5.25
4,chocolate,dark brown,Waffle,5.25
5,bubblegum,pink,Sugar,4.75


Note that, similar to its relative in `datascience`, the `drop` function in pandas does not change the original dataframe.

In [None]:
cones_df

Unnamed: 0,Flavor,Color,Type of Cone2,Price,Type of Cone
0,strawberry,pink,Waffle,3.55,Waffle
1,chocolate,light brown,Sugar,4.75,Sugar
2,chocolate,dark brown,Sugar,5.25,Sugar
3,strawberry,pink,Waffle,5.25,Waffle
4,chocolate,dark brown,Waffle,5.25,Waffle
5,bubblegum,pink,Sugar,4.75,Sugar


Instead, to make changes to the original dataframe, we can specify `inplace = True`. 
In the code below, we drop 2 columns, `Type of Cone` and `Type of Cone2`, in the dataframe.

In [None]:
cones_df.drop(columns=['Type of Cone', 'Type of Cone2'], inplace = True)
cones_df

Unnamed: 0,Flavor,Color,Price
0,strawberry,pink,3.55
1,chocolate,light brown,4.75
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25
5,bubblegum,pink,4.75


#### Drop a row by index

The [`df.drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) function can also drop rows by their index. To do this, we just need to specify the index of the rows we want to drop.

In [None]:
# drops the rows with index 0 and 1
cones_df.drop([0, 1])

Unnamed: 0,Flavor,Color,Price
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25
5,bubblegum,pink,4.75


## Renaming Columns 

If the names of the columns that's read in from our dataset are not descriptive, we may want to change the names of the columns.
To do this in `datascience`, we use `tbl.relabel(old_name, new_name)`. 
Note that this operation in `datascience` is in place and modifies the original table.

In [None]:
cones_tbl.relabel("Price", "Price per Cone")
cones_tbl

Flavor,Color,Price per Cone
strawberry,pink,3.55
chocolate,light brown,4.75
chocolate,dark brown,5.25
strawberry,pink,5.25
chocolate,dark brown,5.25
bubblegum,pink,4.75


The function to do this with `pandas` is `df.rename`. Since you can also change the index of the rows with `df.rename`, 
we would also need to specify that we are changing the name of the columns. 

We do this by passing in a [dictionary](https://www.geeksforgeeks.org/python-dictionary/) to the `columns` parameter, where each key is the name of the old column name and the value corresponding to each key is the new column name.

If you are interested in knowing more about `df.rename` and other ways to use it, you can check out the [documentation page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html).

In [None]:
# rename the column "Price" by modifying the original dataframe
cones_df.rename(columns={'Price':'Price of one Cone'}, inplace=True)
cones_df

Unnamed: 0,Flavor,Color,Price of one Cone
0,strawberry,pink,3.55
1,chocolate,light brown,4.75
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25
5,bubblegum,pink,4.75


With `pandas`, we can also specify the names of all columns with a single line of code. This is especially useful when the dataset
you read in doesn't come with specified column names. 

In [None]:
# change all column labels of cones_df
cones_df.columns = ["Flavor", "Color", "Price per Cone"]
cones_df

Unnamed: 0,Flavor,Color,Price per Cone
0,strawberry,pink,3.55
1,chocolate,light brown,4.75
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25
5,bubblegum,pink,4.75


With the same `df.columns` command, we can also specify which column name we want to change by doing `df.columns.values[col_index] = new_col_name`.


In [None]:
# lists all column labels of cones_df
cones_df.columns

Index(['Flavor', 'Color', 'Price per Cone'], dtype='object')

## Sorting Columns 

What if we want to know the flavor of the most expensive ice-cream? In this case we would like to sort the table by the "Price per Cone" column.
To do this in `datascience`, we need to use `tbl.sort(col_name)`. The default sorting order in `tbl.sort(col_name)` is `descending = False`,
returning a table where the lowest value is on the top. 

In [None]:
# sorts the table by "Price per Cone", with the default descending = False (ascending)
cones_tbl.sort("Price per Cone")

Flavor,Color,Price per Cone
strawberry,pink,3.55
chocolate,light brown,4.75
bubblegum,pink,4.75
chocolate,dark brown,5.25
strawberry,pink,5.25
chocolate,dark brown,5.25


Notice that the `tbl.sort` function does not operate in place and does not make changes to the original table. 

In [None]:
cones_tbl

Flavor,Color,Price per Cone
strawberry,pink,3.55
chocolate,light brown,4.75
chocolate,dark brown,5.25
strawberry,pink,5.25
chocolate,dark brown,5.25
bubblegum,pink,4.75


In [None]:
# to change the orignal table, we need to set the new sorted table to the old table
cones_tbl_sorted = cones_tbl.sort("Price per Cone")
cones_tbl_sorted

Flavor,Color,Price per Cone
strawberry,pink,3.55
chocolate,light brown,4.75
bubblegum,pink,4.75
chocolate,dark brown,5.25
strawberry,pink,5.25
chocolate,dark brown,5.25


[`df.sort_values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) is the function in `pandas` that sorts the dataframe by a specified column. 
One major difference between this function and `tbl.sort(col_name)` is that although both functions default to sorting with ascending order, 
the argument to specify sorting order with `df.sort_values` is `ascending = True/False`. By default, `pandas` sets `ascending=True`.

In [None]:
# sorts cones_df in ascending order of 'Price per Cone'
cones_df.sort_values("Price per Cone")

Unnamed: 0,Flavor,Color,Price per Cone
0,strawberry,pink,3.55
1,chocolate,light brown,4.75
5,bubblegum,pink,4.75
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25


In [None]:
# sorts cones_df in descending order of 'Price per Cone'
cones_df.sort_values("Price per Cone", ascending = False)

Unnamed: 0,Flavor,Color,Price per Cone
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25
1,chocolate,light brown,4.75
5,bubblegum,pink,4.75
0,strawberry,pink,3.55


Similar to `tbl.sort(col_name)`, there is also an `inplace` argument to edit the original dataframe. 

In [None]:
# modify cones_df by sorting it in descending order of 'Price per Cone'
cones_df.sort_values("Price per Cone", ascending = False, inplace = True)
cones_df

Unnamed: 0,Flavor,Color,Price per Cone
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25
1,chocolate,light brown,4.75
5,bubblegum,pink,4.75
0,strawberry,pink,3.55


We can also sort a table by its index using [`sort_index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html). The function works very similarly as `sort_values`, but does not require us to pass in a column since we are sorting the index.

In [None]:
cones_df.sort_index(inplace=True)
cones_df

Unnamed: 0,Flavor,Color,Price per Cone
0,strawberry,pink,3.55
1,chocolate,light brown,4.75
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25
5,bubblegum,pink,4.75


## Merging Dataframes 

Like the `join` function in `datascience`, we use [`merge`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) in `pandas` to join two dataframes based on some common characteristics. 

Here is the `cones_df`. Notice that the `Flavor` column has three unique values: 'strawberry', 'chocolate', and 'bubblegum'.

In [None]:
cones_df

Unnamed: 0,Flavor,Color,Price per Cone
0,strawberry,pink,3.55
1,chocolate,light brown,4.75
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25
5,bubblegum,pink,4.75


We will create a new dataframe called `ratings`. It has a column called `Kind` with the values 'strawberry', 'chocolate' and 'vanilla',

In [None]:
ratings = pd.DataFrame({
    'Kind': ['strawberry', 'chocolate', 'vanilla'],
    'Stars': [2.5, 3.5, 4],
    'No. of Reviewers': [10, 15, 12]}
)
ratings

Unnamed: 0,Kind,Stars,No. of Reviewers
0,strawberry,2.5,10
1,chocolate,3.5,15
2,vanilla,4.0,12


`cones_df` and `ratings` have common values of 'strawberry' and 'chocolate'. 
It might be useful to have a dataframe with both the price information from `cones_df` and rating information from `ratings`. 
Consequently, we are going to join these two dataframes based on the common values in the `Flavor` and `Kind` columns of `cones_df` and `ratings` respectively. 

The type of [merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) we are going to do is called an 'inner join'; this is the most common type of merge and the only one we'll cover (you can learn more about different types of joins [here](https://www.analyticsvidhya.com/blog/2020/02/joins-in-pandas-master-the-different-types-of-joins-in-python/)).
Essentially, an 'inner join' only includes the rows that have common values in both of the dataframes.

The syntax of a merge is as follows: `left_df.merge(right_df, how = 'type of join', left_on = 'column(s) of left df', right_on = 'column(s) of right df')`.

In [None]:
# join cones_df with ratings
cones_df.merge(ratings, how = 'inner', left_on = 'Flavor', right_on = 'Kind')

Unnamed: 0,Flavor,Color,Price per Cone,Kind,Stars,No. of Reviewers
0,strawberry,pink,3.55,strawberry,2.5,10
1,strawberry,pink,5.25,strawberry,2.5,10
2,chocolate,light brown,4.75,chocolate,3.5,15
3,chocolate,dark brown,5.25,chocolate,3.5,15
4,chocolate,dark brown,5.25,chocolate,3.5,15


After we merge the dataframes, we get all the columns in the two dataframes. 
For example, for every instance of `strawberry` in the `cones_df`, merging will match each row with `strawberry` in the `rating` dataframe into the resulting dataframe. 
Since there are two rows that have `strawberry` in the `cones_df`, we have two rows for `strawberry` in the merged dataframe; the rows from the `ratings` dataframe are just repeated.

Thus, the row corresponding to `bubblegum` in the `cones_df` is not included in the merged table as there is no `bubblegum` value in `ratings`. 
Similarly, the row with `vanilla` in `ratings` is not included as `vanilla` does not feature in `cones_df`.

Like other functions in pandas, `merge` creates a new dataframe and does not change the original two dataframes.

## Applying Functions to Columns

In the `datascience` library, you can use the `apply` function to transform the data of a column from our table via a function.
In the example below, we define a function `double` that doubles any input, and will consequently double the price of each ice-cream.

In [None]:
# double the price of each cone
def double(x):
    return x * 2

cones_tbl.apply(double, "Price per Cone")

array([ 7.1,  9.5, 10.5, 10.5, 10.5,  9.5])

In `pandas`, rather than using the `apply` method, we use the [`map`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) method on
a series. Since the series is being called on, we only need pass in the function itself.

In [None]:
doubled_price_per_cone = cones_df["Price per Cone"].map(double)
doubled_price_per_cone

0     7.1
1     9.5
2    10.5
3    10.5
4    10.5
5     9.5
Name: Price per Cone, dtype: float64

## (Optional Section): Lambda Functions 

You should have already learned about functions, but now let's take that a bit further. 
Below is a classical example of a function: the `adder`. This adder function takes in two parameters `a`
and `b` and adds them togehter.

In [None]:
# returns the sum of the 2 input arguments
def adder(a, b):
    return a + b

adder(2, 3)

However, the above code to denote such a simple function could be more concise and more conveniently written. 
This is where the lambda function comes in: below is the same `adder` function but instead of the formal definition of a
function, we can compress the `adder` to be just one line.

In [None]:
# set lambda_adder to be a function that behaves just like the adder function
lambda_adder = lambda a, b: a + b  
lambda_adder(2, 3)

This might seem rather unnecessary, but lambda functions are typically not intended to be used like this.
Below is an example of how thye are supposed to be used in `pandas`. If a function or method requires a function as input
and the function is simple, we can use a lambda rather than a formal function definition.

In [None]:
# we use a lambda function instead of the double function above
doubled_price_per_cone = cones_df["Price per Cone"].map(lambda x: x * 2)
doubled_price_per_cone

## GroupBy Objects and Aggregation Functions 

Suppose we wanted to find the average price for each ice cream flavor: in `datascience`, we can achieve this by calling the `group` method and specify `np.mean` as the aggregation function.

In [None]:
cones_tbl.select('Flavor', 'Price per Cone').group('Flavor', np.mean)

Grouping in `pandas` is done with the [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method.
This process is slightly more involved because calling `groupby` returns a "groupby" object, which we then use to can aggregate on a column. 

In [None]:
# A groupby object is returned
cones_df[['Flavor', 'Price per Cone']].groupby('Flavor')

In addition, we will aggregate on the "groupby" object by using the `agg` function, and passing in the aggregation function.

In [None]:
# We specify an aggregation function by calling .agg() on the resulting groupby object

cones_df[['Flavor', 'Price per Cone']].groupby('Flavor').agg(np.mean)

Note that the column we group by automatically becomes the index of the new dataframe. We can prevent this by setting `as_index=False` in our call to `groupby`.

In [None]:
cones_df[['Flavor', 'Price per Cone']].groupby('Flavor', as_index=False).agg(np.mean)

More generally, grouping with `datascience` takes the form:

`tbl.group(column, func)`

The equivalent expression in `pandas` is:

`df.groupby(column, as_index=False).agg(func)`

## Aggregating multiple columns

In the examples above, we purposely removed the "Color" column before grouping. What happens if we try to aggregate both the "Color" and "Price" columns?

In [None]:
# datascience
cones_tbl.group('Flavor', np.mean)

In [None]:
# pandas
cones_df.groupby('Flavor', as_index=False).agg(np.mean)

Since the "Color" column consists of strings and it doesn't make sense to find the mean value of strings, the aggregation is not successful. 
However, we can see that the situation is handled differently by both libraries — `datascience` keeps the column as a blank column, while `pandas` drops the column entirely.

A limitation of the `datascience` package is that we can only specify one aggregation function. 
In `pandas`, we can pass in multiple aggregation functions as a list to [`agg`](https://pandas.pydata.org/pandas-docs/version/0.23.1/generated/pandas.core.groupby.DataFrameGroupBy.agg.html).

In [None]:
# aggregate "Price per Cone" with both sum and mean functions
cones_df.groupby('Flavor', as_index=False).agg([np.sum, np.mean])

We can also specify an aggregation for each column by passing in a dictionary. 

Here, `' '.join` is a function taht will concatenate an array of strings and put a space in between each string.

In [None]:
# aggregate the Color column by joining the strings, and aggregate the Price per Cone column by tkaing the mean
cones_df.groupby('Flavor', as_index=False).agg({'Color': ' '.join, 'Price per Cone': np.mean})

In addition, we can also write our own aggregation function and pass that in. Suppose we wanted to find half the mean price of each ice cream flavor.

In [None]:
# aggregate all relevant columns using the half_mean function defined below
def half_mean(array):
    return np.mean(array)/2

cones_df.groupby('Flavor', as_index=False).agg(half_mean)

Lastly, we can omit `agg` entirely for some of the more common aggregation functions like sum, mean, max, and min.

In [None]:
# aggregate all relevant columns by taking the sum
cones_df.groupby('Flavor', as_index=False).sum()

In [None]:
# aggregate all relevant columns by taking the mean
cones_df.groupby('Flavor', as_index=False).mean()

In [None]:
# aggregate all relevant columns by taking the max
cones_df.groupby('Flavor', as_index=False).max()

In [None]:
# aggregate all relevant columns by taking the min
cones_df.groupby('Flavor', as_index=False).min()

## Pivot

Pivot tables are useful for us to 'break-up' our data into bins in 2 dimensions along 2 different features.
For example, we can use a pivot table would like to see the relationship between Flavor and Color on the price of ice-cream.

Let's try it using the `datascience` package first. Here is the `cones_tbl`.

The syntax for a pivot table in `datascience` is `tbl.pivot(feature_for_columns, feature_for_rows, values = feature_for_values, collect = aggregation_function)`. Here, we look at the average price for each combination of Color and Flavor.

In [None]:
cones_tbl.pivot('Flavor', 'Color', values = 'Price per Cone', collect = np.mean)

Let us try to replicate this table in `pandas`. Here is the `cones_df` printed for reference.

The syntax for a pivot table is:
```python
pd.pivot_table(df, values = feature_for_values, 
               index = feature_for_rows, columns = feature_for_columns, 
               aggfunc = aggregation_function)
```

Let us try to sum prices.

In [None]:
pd.pivot_table(cones_df, values='Price per Cone', index = 'Flavor', columns = 'Color', aggfunc = sum)

In `datascience`, when a combination is not in the dataset (like bubblegum and dark brown), the pivot table will show a `0`. In `pandas`, the table will show `NaN` (Not a Number). Also, as you can see, in `pandas` the values of Flavor have become the index.

To do the default pivot table of counts in `pandas`, we would pass in the `len` function, which returns the length of an array:

In [None]:
pd.pivot_table(cones_df, values='Price per Cone', index = 'Flavor', columns = 'Color', aggfunc = len)

## Lists and List Comprehensions 

Throughout Data 8, we've been dealing with `datascience` and `numpy` arrays. The reason why we use these arrays
is because they are very fast in processing large mathematical computations on columns of data. However, using these arrays is
sometimes not the best solution and can instead be slower than the more conventional python `list` objects.
Lists are just like `numpy` and `datascience` arrays but are built straight into python. 
Below are examples of a `numpy` array and a `list` containing the 
same information.

In [None]:
# Like an array but native!
datascience_array = make_array(1, 2, 3, 4, 5)
datascience_array

array([1, 2, 3, 4, 5])

In [None]:
python_list = [1, 2, 3, 4, 5]
python_list

[1, 2, 3, 4, 5]

Just like that `numpy` array, we can add new elements to python `list`s. Here we append
elements to both a `numpy` array and a python `list`.

In [None]:
# Adding elements to datascience arrays
datascience_array = np.append(datascience_array, 6)
datascience_array

array([1, 2, 3, 4, 5, 6])

In [None]:
# Adding elements to python lists
python_list.append(6)  # List appending
python_list += [7]  # List concatenation
python_list.extend([8, 9])  # List extensions
python_list

[1, 2, 3, 4, 5, 6, 7, 8, 9]

We can index the python `list` similarly to a `numpy` array. 
But instead of using `.item` in python, we directly use square brackets.

In [None]:
datascience_array.item(5)

6

In [None]:
python_list[5]  

6

In `datascience`, if we wanted to create a list or array and repeatedly append elements into it, we would first use `make_array` to create an array then use `np.append` to add to it.

In [None]:
doubled_for_loop_ds = make_array()
for item in datascience_array:
    doubled_for_loop_ds = np.append(doubled_for_loop_ds, item * 2)
doubled_for_loop_ds

array([ 2.,  4.,  6.,  8., 10., 12.])

In python, we can do a very similar procedure:

In [None]:
doubled_for_loop_python = []
for item in python_list:
    doubled_for_loop_python.append(item * 2)

However, there is a much more concise and quicker way of doing the same thing in python, called a list comprehension.
Try to see if you can guess how it works: 

In [None]:
# Doubling values
datascience_array = datascience_array * 2
doubled_list = [2 * item for item in python_list]

print(datascience_array)
print(python_list)
print(doubled_list)

[ 2  4  6  8 10 12]
[1, 2, 3, 4, 5, 6, 7, 8, 9]
[2, 4, 6, 8, 10, 12, 14, 16, 18]


A list comprehension has the following syntax: `[operation for element in list]`. 
In the above case, we multiplied each `element` (which we named item) by 2 to create a doubled list.

## Dictionaries

A dictionary is one of the most flexible ways to store data in python. It is an unordered collection of data values. It consists of key/value pairs. The `key` is the identifier that maps to the `value`. 

The syntax of a dictionary is `{key: value, key: value, ...}`. Here are some examples to show the versatility of dictionaries.

In [None]:
# empty dictionary
dictionary = {}
dictionary

In [None]:
# dictionary with one key-value pair
dictionary = {'name': 'Alan'}
dictionary

The `key` in the above example is the string 'name' and the corresponding `value` is 'Alan'.

In [None]:
# dictionary with multiple key value pairs
dictionary = {'name': 'Alan', 7: 3, 'assortment': [6,4,8,'five']}
dictionary

Note the following things in the above example.

1. There are multiple keys: 'name', 7, 'assortment'
2. The keys need not be just strings. The integer 7 serves as a key.
3. We can store different type of values with different keys. 
Over here, the value associated with `name` is the string `Alan`. The value associated with `7` is the integer `3`. The value associated with `assortment` is the list `[6,4,8,'five']`.

You can add an item to a existing dictionary like this.

In [None]:
dictionary["new_key"] = "new_value"
dictionary

We have added the key `new_key` with the value `new_value` to the dictionary. We can also remove a values from a dictionary like this.

In [None]:
dictionary.pop("new_key")
dictionary

You can extract the values of a particular key from a dictionary like this: `dictionary[value]`.

In [None]:
dictionary['name']


Note that if you index into a array or a list like `list[1]`, you get the element at index 1. 
But since a dictionary is unordered, `dictionary[1]` is looking for the integer key `1` and not the element at position 1. 

In [None]:
# this gives use the value associted with integer key 7
dictionary[7]

You can extract all the keys of a dictionary as a list using `dictionay.keys()`.

In [None]:
dictionary.keys()

It is helpful in situations like for loops.

In [None]:
for key in dict.keys():
    print(dict[key])

A very good use of dictionaries is to build a pandas dataframe from scratch like this.

In [None]:
flowers_dict = {'Number of petals': [8, 34, 5],'Name': ['lotus', 'sunflower', 'rose'],'Color': ['pink', 'yellow', 'red']}
flowers_dict

In [None]:
flowers_df = pd.DataFrame(data = flowers_dict)
flowers_df

The keys of the dictionary are automatically the column headers and the values assocatied with the key are the column values.