# Lab 2 - Introduction to Jupyter Notebooks, Data Structures, and Pandas

This introductory notebook will review concepts that you may already be familiar with from Data 8 or similar courses. The basic strategies and tools for data analysis covered in this notebook will be the foundations of this class. It will cover an overview of our software and some programming concepts.



# **Part 1 - Our Computing Environment, Jupyter notebooks** 
This webpage is called a "Jupyter notebook" (we'll call it "notebook" for short). A notebook is a place to **write programs** and **view their results and output**. 

## Text cells
In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings. 

After you edit a text cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions of the lab.) 

Tip: You can also use a shortcut of **`Shift + Return (enter)`** to "run" this cell.

**Understanding Check 1** This paragraph is in its own text cell.  
Try editing it so that this sentence is the last sentence in the paragraph, and then click the "run cell" ▶| button. This sentence, for example, should be deleted.  So should this one.

## Code cells
Other cells contain code in the Python 3 language. Running a code cell will execute all of the code it contains.

To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, either press ▶| or again, hold down the `shift` key and press `return` or `enter`.

The fundamental building block of Python code is an expression. Cells can contain multiple lines with multiple expressions. 

When you run a cell, the lines of code are executed in the order in which they appear. Every `print` expression prints a line. Run the next cell and notice the order of the output.

Note: Notice the change from "Markdown" to "Code" on the top.

In [None]:
print("First this line is printed,")
print("and then this one.")
print("Hello World")
print("\N{WAVING HAND SIGN}, \N{EARTH GLOBE ASIA-AUSTRALIA}!")

Don't be scared if you see a "Kernel Restarting" message! Your data and work will still be saved. Once you see "Kernel Ready" in a light blue box on the top right of the notebook, you'll be ready to work again. 

If a restart happens, you should **rerun** any cells with imports, variables, and loaded data.

## Writing Jupyter notebooks
You can use Jupyter notebooks for your own projects or documents.  When you make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the **`+`** button in the menu bar.  It'll start out as a text cell.  You can change it to a code cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Code".

Tip: as a shortcut, you can also select the cell (single click or click on the left of it) and press **`A`** to add a cell above or **`B`** to add a cell below. 

## Errors
Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:

1. The rules are **simple**.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are **rigid**.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

Whenever you write code, you'll make mistakes.  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

**Errors are okay**; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell.  Run it and see what happens.

In [None]:
print("This line is missing something."

The last line of the error output attempts to tell you what went wrong.  The **syntax** of a language is its structure, and this **`SyntaxError`** tells you that you have created an illegal structure.  **"`EOF`"** means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it.  

**Understanding Check 2** Try to fix the code above so that you can run the cell and see the intended message instead of an error.

------------------------

# **Part 2 - Python basics -  Introduction to Programming Concepts and Data** 
Before getting into the more advanced analysis techniques that will be required in this course, we need to cover a few of the foundational elements of programming in Python.

## A. Expressions
The departure point for all programming is the concept of the **expression**. 

An expression is a combination of variables, operators, and other Python elements that the language interprets and acts upon. Expressions act as a set of instructions to be fed through the interpreter, with the goal of generating specific outcomes. See below for some examples of basic expressions.

In [None]:
# Examples of expressions:

#addition
print(2 + 2)

#string concatenation 
print('me' + ' and I')

#you can print a number with a string if you cast it 
print("me" + str(2))

#exponents
print(12 ** 2)


You will notice that only the last line in a cell gets printed out. If you want to see the values of previous expressions, you need to call **`print`** on that expression. Try adding **`print`** statements to some of the above expressions to get them to display.

## B. Data Types

In Python, all things have a type. In the above example, you saw saw __*integers*__ (positive and negative whole numbers) and __*strings*__ (sequences of characters, often thought of as words or sentences). 

We denote strings by surrounding the desired value with quotes: 
* For example, "Data Science" and "2017" are strings, while `bears` and `2020` (both without quotes) are not strings (`bears` without quotes would be interpreted as a variable). 

In addition to strings and integers, you'll also be using decimal numbers in Python, which are called __*floats*__ (positive and negative decimal numbers). 

You'll also often run into __*booleans*__. They can take on one of two values: `True` or `False`. Booleans are often used to check conditions; for example, we might have a list of dogs, and we want to sort them into small dogs and large dogs. One way we could accomplish this is to say either `True` or `False` for each dog after seeing if the dog weighs more than 15 pounds. 

We'll soon be going over additional data types. Below is a table that summarizes the information in this section:

|Variable Type|Definition|Examples|
|-|-|-|
|Integer|Positive and negative whole numbers|`42`, `-10`, `0`|
|Float|Positive and negative decimal numbers|`73.9`, `2.4`, `0.0`|
|String|Sequence of characters|`"Go Bears!"`, `"variables"`|
|Boolean|True or false value|`True`, `False`|


## C. Variables
In the example below, __`a`__ and __`b`__ are Python objects known as __variables__. 

We are giving an object (in this case, an `integer` and a `float`, two Python data types) a name that we can store for later use. To use that value, we can simply type the name that we stored the value as. 

Variables are stored within the notebook's environment, meaning stored variable values carry over from cell to cell.

In [None]:
# assign values to "a" and "b"
a = 4
b = 10/5

# Notice that "a" retains its value.
print(a)

In [None]:
# add variables "a" and "b"
a + b

### Question 1: Variables
Let's see if we can write a series of expressions that creates two new variables called __x__ and __y__ and assigns them values of __10.5__ and __7.2__. Then assign their product to the variable __combo__ and print it.

In [None]:
# Fill in the missing lines to complete the expressions.


Check to see if the value you get for **combo** is what you expect it to be.

## D. Strings
In Python, a string is a sequence of characters. Strings can be created by enclosing characters inside a single quote or double-quotes. 

In [None]:
#All 4 strings are the same 
String = "Hello World" 
String = 'Hello World' 
String = """Hello World""" 
String = "Hello" + " " + "World"
print(String)

In [None]:
first_letter = String[0] # Indexing in python start at 0 
print(first_letter)

last_letter = String[-1] # Indexing into the last character 
print(last_letter)

In [None]:
## empty string
string = "" 
string

Note: You cannot do the same operations on strings as you can on integers.

In [None]:
four = 4
three = 3
four + three

In [None]:
four = "4"
three = "3"
four + three

As you can see, "adding" strings concatenates them.

In [None]:
four = 4
three = "3"
four + three

Why didn't it work? Let's look at the **types** of data we are dealing with:

In [None]:
type(four)

In [None]:
type(three)

## E. Lists
The next topic is particularly useful in the kind of data manipulation that you will see throughout this class. 

The following few cells will introduce the concept of __`lists`__ (and their counterpart, __`numpy arrays`__). Read through the following cell to understand the basic structure of a list. 

* A list is an __ordered collection of objects__. They allow us to store and access groups of variables and other objects for easy access and analysis. Check out this [documentation](https://www.tutorialspoint.com/python/python_lists.htm) for an in-depth look at the capabilities of lists.

To **initialize a list**, you use **square brackets**. Putting objects separated by commas in between the brackets will add them to the list. 

In [None]:
# an empty list
lst = []
print(lst)

In [None]:
# reassigning our empty list to a new list
lst = [1, 3, 6, 'lists', 'are' 'fun', 4]
print(lst)

In [None]:
#lists in python are zero-indexed so the indices for lst are 0,1,2,3,4,5 and 6
example = lst[2]
print(example)

In [None]:
#list slicing: This line will store the first (inclusive) through fourth (exclusive) elements of lst as a new list 
#called lst_2:
lst_2 = lst[1:4]
lst_2

It is important to note that when you store a list to a variable, you are actually storing the **pointer** to the list. That means if you assign your list to another variable, and you change the elements in your other variable, then you are changing the same data as in the original list. 

In [None]:
a = [1,2,3] #original list
b = a #b now points to list a 

In [None]:
# change first element in list "b" to "4"
b[0] = 4 

In [None]:
# print first elemtn of allegedly non-modified list "a"
print(a[0])  

As you can see, even though we modified list "b", our list "a" return 4 since we modified the first element of the list pointed to by "a" and "b"

### Question 2: Lists
Build a list of length 10 containing whatever elements you'd like. Then, slice it into a new list of length five using a index slicing. Finally, assign the last element in your sliced list to the given variable and print it.

In [None]:
### Fill in the ellipses to complete the question.
...

Lists can also be operated on with a few built-in analysis functions. These include __`min`__ and __`max`__, among others. 

Lists can also be __concatenated__ together. 

Find some examples below.

In [None]:
# A list containing six integers.
a_list = [1, 6, 4, 8, 13, 2]

In [None]:
# Another list containing six integers.
b_list = [4, 5, 2, 14, 9, 11]

In [None]:
print('Max of a_list:', max(a_list))
print('Min of b_list:', min(a_list))

In [None]:
# Concatenate a_list and b_list:
c_list = a_list + b_list
print('Concatenated:', c_list)

## F. Numpy Arrays
Closely related to the concept of a list is the __array__, which is a __nested sequence of elements that is structurally identical to a list__. 

Arrays, however, can be operated on arithmetically with much more versatility than regular lists. This means that __numpy arrays__ can be __faster__ than python lists. Although this difference is not important when working with small lists, when sizes of lists gets bigger, this difference becomes important.

For the purpose of data manipulation, we'll access arrays through [Numpy](https://docs.scipy.org/doc/numpy/reference/routines.html), which will require an __import statement__.

Now run the next cell to import the numpy library into your notebook, and examine how numpy arrays can be used.

In [None]:
import numpy as np

In [None]:
# Initialize an array of integers 0 through 9.
example_array = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
# This can also be accomplished using np.arange
example_array_2 = np.arange(10)
print('Undoubled Array:', example_array_2)

In [None]:
# Double the values in example_array and print the new array.
double_array = example_array * 2
print('Doubled Array:', double_array)

This behavior differs from that of a list. See below what happens if you multiply a list.

In [None]:
example_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
example_list * 2

Notice that instead of multiplying each of the elements by two, multiplying a list and a number returns that many copies of that list. This is the reason that we will sometimes use Numpy over lists. Other mathematical operations have interesting behaviors with lists that you should explore on your own. 

## G. Looping
[__`Loops`__](https://www.tutorialspoint.com/python/python_loops.htm) are often useful in manipulating, iterating over, or transforming large lists and arrays. 

The first type we will discuss is the __`for loop`__. For loops are helpful in **traversing a list** and **performing an action on each element**.

For example, the following code moves through every element in `example_array`, adds it to the previous element in `example_array`, and copies this sum to a new array. 

It's important to note that `"element"` is an **arbitrary** variable name used to represent whichever index value the loop is currently operating on. We can change the variable name to whatever we want and achieve the same result, as long as we stay consistent.

In [None]:
new_list = []

In [None]:
for element in example_array:    # for every element in a list - ie how many times we perform the action below
    new_element = element + 5    # action to perform - add 5
    new_list.append(new_element) # append this "new element" to "new list"

print(new_list)

In [None]:
#iterate using list indices rather than elements themselves
for i in range(len(example_array)):
    example_array[i] = example_array[i] + 5

example_array

## Other types of loops - while loop
The __while loop__ repeatedly performs operations until a conditional is no longer satisfied. A conditional is a [boolean expression](https://en.wikipedia.org/wiki/Boolean_expression), that is an expression that evaluates to `True` or `False`. 

What makes a while loop different from a for loop is that a for loop ends after a fixed number of iterations, whereas a while loop ends when a True or False condition is met. **Many times, for loops are sufficient for most tasks, at least in my experience.**

In the below example, an array of integers 0 to 9 is generated. When the program enters the while loop on the subsequent line, it notices that the maximum value of the array is less than 50. Because of this, it adds 1 to the fifth element, as instructed. Once the instructions embedded in the loop are complete, the program refers back to the conditional. Again, the maximum value is less than 50. This process repeats until the the fifth element, now the maximum value of the array, is equal to 50, at which point the conditional is no longer true and the loop breaks.

In [None]:
while_array = np.arange(10)        # Generate our array of values

print('Before:', while_array)

In [None]:
while(max(while_array) < 50):      # Set our conditional
    while_array[4] += 1            # Add 1 to the fifth element if the conditional is satisfied 
    
print('After:', while_array)

### **Question 3** - For loops
Print every letter of the string below:

In [None]:
string = "Hello World"
string

In [None]:
for letter in string: 
  print(letter)    

Create a new list called "do_people_like_newman". Add to every string in the `seinfeld` list below the string "dislikes Newman". Store your new strings in the new "newman" list

In [None]:
seinfeld = ['Jerry', 'Elaine', 'Kramer', 'Costanza']
seinfeld

In [None]:
do_people_like_newman = []

for character in seinfeld: 
    character = character + " dislikes Newman"
    do_people_like_newman .append(character)
    


In [None]:
do_people_like_newman

## H. Dictionaries
The [__`dictionary`__](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) data structure is a collection of **"key-value"** pairs. Sometimes dictionaries are referred to as "maps" or "associative arrays". 


```
Dictionary = { key1 : value1 , 
               key2 : value2 , 
               key3 : value3 }
```

You might notice that unlike lists that use `[]`, dictionaries are initialized with a `{}` - also known as **squigly brackets** - and inside these brackets we have the  **key : value** pair. Each key-value pair is separated by a comma. Unlike lists, dictionaries are indexable by their "keys." 


For example if we wanted to create dictionary called "states" where:
* the __key__ is the __name of the state__ 
* the __value__ is the __state abbreviation__.


we would use the following code:

In [None]:
states = {"California" : 'CA',
               "Idaho" : 'ID',
              "Nevada" : 'NV'}

print(states)

We can **access** the abbreviations (values) by indexing into the dictionary with brackets and the key value.

Note that in a list, we primarily index based on numbers. Here, we can use the key - which is a string. Thus, dictionaries can be more intuitive.

For example, if you want to return **`VALUE`** associated with **`KEY`**, you would do the following:

```
example_dict[KEY]
```

This would return **`VALUE`**.

### **Question 4** - Dictionaries
How would you return the abbreviation for California using the states dictionary above? Assign `result` to this expression.

In [None]:
# Using the states dictionary above, assign result to 'CA' by replacing the ellipses
result = states["California"]
result

Just like an actual dictionary, each "key" can store  multiple objects  - which is what makes dictionaries very useful. Again, unlike a list, we don't need to know the position of an element in a list - we can just "call" it via "keys". 

               

In the example below, the word park **park** is used as a key to store a **list** of its definitions.

In [None]:
dictionary = {'parity': 'the quality or state of being equal or equivalent',
              'park' : ["a large public green area in a town, used for recreation" , 
                         "bring (a vehicle that one is driving) to a halt and leave it temporarily"]
              }
             

How would you return the second definition of the word "park" from the dictionary above?

In [None]:
## Write your code below
...

## I. Functions
Functions are useful when you want to repeat a series of steps on multiple different objects, but don't want to type out the steps over and over again. Many functions are built into Python already; for example, you've already made use of `len()` to retrieve the number of elements in a list. You can also write your own functions, and at this point you already have the skills to do so.


Functions generally take a set of __parameters__ (also called inputs), which define the objects they will use when they are run. For example, the `len()` function takes a list or array as its parameter, and returns the length of that list.

Let's look at a function that takes two parameters, compares them somehow, and then returns a boolean value (`True` or `False`) depending on the comparison. The `is_multiple` function below takes as parameters an integer `m` and an integer `n`, checks if `m` is a multiple of `n`, and returns `True` if it is. Otherwise, it returns `False`. 

`if` statements, just like `while` loops, are dependent on boolean expressions. If the conditional is `True`, then the following indented code block will be executed. If the conditional evaluates to `False`, then the code block will be skipped over. Read more about `if` statements [here](https://www.tutorialspoint.com/python/python_if_else.htm).

Below, we will use the modulus operator `%`. 
* Typing `x % y` will return the remainder when `y` is divided by `x`. 
* Therefore, (`x % y == 0`) will return `True` when `y` divides `x` with remainder `0`

In [None]:
def is_multiple(m, n):
    if (m % n == 0): 
        return True
    else:
        return False

In [None]:
is_multiple(12, 4)

In [None]:
is_multiple(12, 7)

## Review
To sum up:

* Variables - which are "names" used for storing objects 
* Basic data structures - `strings`, `lists`, `integers`, `dictionaries`, etc
* Data can be manipulated at scale using loops
* There are also functions, which can be called. They usually require parameters as inputs. 

-----------------------
# Part 3 - Introduction to Pandas

For this course, we'll be working extensively with the [__`Pandas`__](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html) package. This is an incredibly powerful tool that is used for examining, creating, and manipulating **Tabular Data** - think excel spreadsheets with "rows" and "columns."

Let's import pandas below:

In [None]:
import pandas as pd

## Creating a Pandas Dataframe


Let's make our own DataFrame from scratch without having to import data from another file. Let's say we have two arrays (or lists), one with a list of fruits, and another with a list of their prices. Then, we can create a new `DataFrame` with each of these arrays as columns by setting the argument `data` in `pd.DataFrame()` to a **dictionary** of these columns:


In [None]:
fruit_names = ['Apple', 'Orange', 'Banana']
fruit_prices = [1, 0.75, 0.5]

fruit_table = pd.DataFrame(data = {
     "Fruit"    : fruit_names,
     "Price ($)": fruit_prices})

fruit_table

As you can see, the pandas **`DataFrame`** method uses a **dictionary** of pairs of column labels and its corresponding arrays (lists), and creates a new DataFrame with each array as a column of the DataFrame. Finally, to create a new dataframe (with no columns or rows), we simply write:To create a new dataframe (with no columns or rows), we simply write:

In [None]:
empty_table = pd.DataFrame()
empty_table

We typically start off with empty tables when we need to add rows inside for loops, which we'll see later.

## Importing data from file using Pandas

We will use the "california_housing_train" data, which contains the median house prices for California Districts derived from the 1990 census.


To create this table, we will draw the data from the path `data/`, stored in a file called `california_housing_train.csv`. In general, to import data from a `.csv` file, we write **`pd.read_csv("file_path_&_name")`.** Information in `.csv`'s are separated by commas, and are what are typically used with the `pandas` package. 

In [None]:
california_housing = pd.read_csv("data/california_housing_train.csv")


Now that we have loaded in our DataFrames we want see our new table. Here are a few useful methods to see our table. 

* We can use the **`head()`** function to see the first 5 rows of our table. Alternatively we can use the **`tail()`** to see the last 5 rows of our table. 
* The **shape** method returns a tuple where the first item is the number of rows and the second is the number columns in the table. 

```
df.head()
df.tail()
df.shape
```

* Now that we know what are table looks like and the shape of it, we can use the **`describe()`** function to see basic summary statistics from our table. 

```
df.describe()
```

In [None]:
# Show first 5 rows of our data
california_housing.head()

In [None]:
# Show last 5 rows of our data
df.tail()

In [None]:
df.shape

In [None]:
rows = df.shape[0] 
columns = df.shape[1]
print("The number of rows is: " + str(rows), "  The number of columns is: " + str(columns))

In [None]:
df.describe()

### Renaming columns

To rename  columns we can use the rename method for pandas dataframes. You can find the documentation for this method method [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html).

In [None]:
new_names = {
    "longitude": "LONG",
    "latitude": "LAT"}

In [None]:
california = california_housing.rename(index = str, 
                                       columns = new_names)
california.head()

## Accessing Values

Often, it is useful to access only the rows, columns, or values related to our analysis. We'll look at several ways to cut down our table into smaller, more digestible parts.

Let's say we wanted to grab only the first _three_ rows of this DataFrame. We can do this by using the **`loc`** function; it takes in a list or range of numbers, and creates a new DataFrame with rows from the original DataFrame whose indices are given in the array or range. Remember that in Python, indices start at 0! Below are a few examples:

In [None]:
california.iloc[[1, 3, 5]] # Takes rows with indices 1, 3, and 5 (the 2nd, 4th, and 6th rows)

In [None]:
california.iloc[[7]] # Takes the row with index 7 (8th row)

In [None]:
california.iloc[np.arange(7)] # Takes the row with indices 0, 1, ... 6

Similarly, we can also choose to display certain columns of the DataFrame. There are two methods to accomplish this, and both methods take in lists of either column indices or column labels:
- Insert the names of the columns as a list in the DataFrame
- The **`drop`** method creates a new DataFrame with all columns _except_ those indicated by the parameters (i.e. the parameters are dropped).

Some examples:

In [None]:
california.loc[:, ["housing_median_age", "total_rooms"]].head() # Selects only "housing_median_age" and "total_rooms" columns

In [None]:
california[['median_income','population']] # select only these two columns

In [None]:
california.drop(california.columns[[0, 1]], axis=1).head() # Drops the columns with indices 0 and 1

In [None]:
california.iloc[[1,2,3,5], [3,5]] # Select only columns with indices 1 and 5, 
                                   # then only the rows with indices 1, 2, 3, 5

If you want to select more than one column at a time and/or a certain number of rows you can use 


```
df.loc(: , ['column_name' , 'column_name']) 
```

where the first argument is the index you want and the second argument is the list of columns you want. 

The " : " after **`.loc(`**  is shorthand for **all**. This example gives you all the rows for the two columns. 

### Question 6 
To make sure you understand the `loc`, `iloc`, and `drop` functions, try selecting the columns "median_house_value" to "households" with only the first 3 rows:

In [None]:
# YOUR CODE HERE
california.iloc[1:4, 4:10]


Finally, the `loc` function in the DataFrame can be modified so instead of only choosing certain rows or columns you can give conditions for the selected columns or rows:
- A column label
- A condition that each row should match

In other words, we call the select rows as so: `DataFrame_name.loc[DataFrame_name["column_name'] filter]`.


Here are some examples of selection:

The variable `median_house_value` indicates median house value in an area. The below query will find all rows (areas) of the house value is exactly 90000

In [None]:
california.loc[california["median_house_value"] == 90000]

The variable `population` corresponds to population in an area. With the following where statement, we'll find the variables where the population is between 1 and 100 (ie sparsely populated areas).

In [None]:
df_population = california.loc[california["population"].isin(np.arange(1, 100))]
df_population

## Sorting

It can be very useful to sort our DataFrames according to some column. The `sort` function does exactly that; it takes the column that you want to sort by. By default, the `sort_values` function sorts the table in _ascending_ order of the data in the column indicated; however, you can change this by setting the optional parameter `ascending=False`.

Recall that we created a `df_population` dataset above which contained areas with small population. Let's sort the values by the `median_house_value` column to see sparsely populated areas with expensive houses.   

In [None]:
df_population.sort_values(by=['median_house_value'], 
                          ascending = False) # Sort table by value of property taken in ascending order

Example: Only keep rows where population is above the average

In [None]:
above_average = california[california["population"] > np.mean(california["population"])]
above_average.head()

## Removing NAs and Duplicates from the DataFrame
Next we will cover dropping unwanted values and duplicate rows using **`dropna()`** and **`drop_duplicates()`** respectively. Both of these functions return a new DataFrame without changing the original by default. 

In order to store the new table you will have to **assign it to a variable**. This is generally the default behavior for most Pandas functions

Let's go back to our fruit tables dataframe

In [None]:
fruit_table

In [None]:
fruit_table.loc[4] = np.nan ## insert NA
fruit_table.loc[5] = fruit_table.loc[0] ## insert duplicate

In [None]:
fruit_table

In [None]:
new_fruit_df = fruit_table.drop_duplicates()
new_fruit_df

In [None]:
new_fruit_df = fruit_table.dropna()
new_fruit_df

### Manipulating data in columns
One way to manipulate DataFrame tables is to use **`df['column name']`** to return one column of the table. 

In the example below we look at the **housing_median_age** column and use **`value_counts()`** to see how many times each unique value appears. Similarly, you can also apply other functions to columns. 

Adding new columns also uses this **`["column name"]`** syntax. You can specify **`df["column name"]`** and set it equal to the data you want to add. For example if you wanted to add a column of names with all upper case letters.
```
df["upper_case_names"] = df["names"].str.upper() 
```

In [None]:
## this might give a warning - but is more intuitive
new_fruit_df['Fruit'] = new_fruit_df['Fruit'].str.upper()

In [None]:
new_fruit_df

### Question 7
Try lowercasing the Fruit column

## Summary of pandas functions

As a summary, here are some useful functions that you can use with pandas.
    
|Name|Example|Purpose|
|-|-|-|
|`DataFrame`|`DataFrame()`|Create an empty DataFrame, usually to extend with data|
|`pd.read_csv`|`pandas.read_table("my_data.csv")`|Create a DataFrame from a data file|
|`pd.DataFrame({})`|`df = pandas.DataFrame({"N": np.arange(5), "2*N": np.arange(0, 10, 2)})`|Create a copy of a DataFrame with specified columns|
|`loc`|`df.loc[df["N"] > 10]`|Create a copy of a DataFrame with only the rows that match some *predicate*|
|`loc`|`df.loc["N"]`|Create a copy of a DataFrame with only specified column names|
|`(subsetting)`|`df[["N"]]`|Another way to create a copy of a DataFrame with only specified column names|
|`iloc`|`df.iloc(np.arange(0, 6, 2))`|Create a copy of the DataFrame with only the rows whose indices are in the given array|
|`sort`|`df.sort(["N"])`|Create a copy of a DataFrame sorted by the values in a column|
|`index`|`len(df.index)`|Compute the number of rows in a DataFrame|
|`columns`|`len(tbl.columns)`|Compute the number of columns in a DataFrame|
|`drop`|`df.drop(columns=["2*N"])`|Create a copy of a DataFrame without some of the columns|
