# Lab 2 - Python and Pandas 

Upon successful completion of this assignment, a student will be able to:

* Gain experience in formatting text using Markdown
* Load in a data set, access it, and explore its properties.
* Submit assignment to Gradescope.

We start with the standard setup for our notebook files importing standard modules.

In [1]:
#  Import standard modules  
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

import otter
grader = otter.Notebook()

## Example 1 - More Data Cleaning 
*Adapted from J. Sullivan*

Let's look at another data file to see additional data cleaning steps and code.  

The initial data set reads in part: 

![property data](images/property-data.jpg)

In [2]:
prop = pd.read_csv("data/property.csv")
prop

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3,1,1000
1,100002000.0,197.0,LEXINGTON,N,3,1.5,--
2,100003000.0,,LEXINGTON,N,,1,850
3,100004000.0,201.0,BERKELEY,12,1,,700
4,,203.0,BERKELEY,Y,3,2,1600
5,100006000.0,207.0,BERKELEY,Y,,1,800
6,100007000.0,,WASHINGTON,,2,HURLEY,950
7,100008000.0,213.0,TREMONT,Y,--,1,
8,100009000.0,215.0,TREMONT,Y,na,2,1800


We can see that `pandas` is already able to find some of the different ways that we have missing values in the data.

For instance in the ST_NUM column, the 3rd entry is blank and the 7th entry is NaN.  `pandas` filled in the blank entry with "NA".  Both of these values are found by the `isnull()` method.

In [3]:
prop['ST_NUM'].isnull()

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
8    False
Name: ST_NUM, dtype: bool

However, there are other missing value encodings that pandas does not immediately recognize. 

Let's look at the Num_Bedrooms column. 

![property data 2](images/property-data2.jpg)




In this column, we have missing values as "n/a", "NA", "--" and "na".

Let's see what `pandas` automatically recognizes.

In [4]:
prop['NUM_BEDROOMS'].isnull()

0    False
1    False
2     True
3    False
4    False
5     True
6    False
7    False
8    False
Name: NUM_BEDROOMS, dtype: bool

`pandas` automatically recognizes the "n/a" and "NA" but not the "--" and "na". 

Let's change that! 

In [5]:
# Making a list of missing value types
missing_values = ["n/a", "na", "--", "NA"]
prop2 = pd.read_csv("data/property.csv", na_values = missing_values)

In [6]:
print (prop2['NUM_BEDROOMS'])
print (prop2['NUM_BEDROOMS'].isnull())

0    3.0
1    3.0
2    NaN
3    1.0
4    3.0
5    NaN
6    2.0
7    NaN
8    NaN
Name: NUM_BEDROOMS, dtype: float64
0    False
1    False
2     True
3    False
4    False
5     True
6    False
7     True
8     True
Name: NUM_BEDROOMS, dtype: bool


## Example 2 - Printing


In many courses, tutorials for new languages the first thing you learn is printing "Hello World"

In [7]:
print('Hello World')

Hello World


In [8]:
firstName = "Adam, Andrew, and Daniel"

In [9]:
"Hello " + firstName + "!"

'Hello Adam, Andrew, and Daniel!'

Use inbuilt function `dir()` to the variable "firstName" above and print the outcome.

https://docs.python.org/3/library/functions.html#dir

In [10]:
dir(firstName)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',


This lists all the functions available to be used on the "string" `firstName'

## Q1 - Strings

I want you to explore using the string functions: `len()`, `split()`, and `strip()` on the following strings. 

https://docs.python.org/3/library/functions.html

In [11]:
className = " Introduction    to   Data Science   "

In [12]:
# Show how to find the length of the string "className" 
# Store the results in a new variable "class_length"
class_length = len(className)
class_length

37

In [13]:
# Show the results of the `split()` function on the string "className"  
# Store the results in a new variable "class_split"
class_split = className.split()
class_split

['Introduction', 'to', 'Data', 'Science']

In [14]:
# Save the results of the `strip()` function on the string "className" in a 
# new variable "className2"
className2 = className.strip()
className2

'Introduction    to   Data Science'

In [15]:
grader.check("q1")

## Example 3 - Comments 

To create a comment line (in line with the code), # (hash) symbol is used, followed by a space. (Short key: `Ctrl + /` ) [To comment out, remove # or use `Ctrl + /` again]

Other options are using the triple quotes (""")or (''') known as backticks, to enclose the complete sentence as a comment.(This needs to be on different line other than the code). Different programming language has different approches for commenting. Please be aware.

In [16]:
# This is a comment

In [17]:
'''This is a larger comment block 
that may span multiple lines 
'''
2 + 2

4

<!-- BEGIN QUESTION -->

## Q2 - Markdown

Markdown option for cells in the jupyter notebook provides a way to display information to the use around the particular code snippets. For more information and reading, please look into:

https://help.github.com/articles/basic-writing-and-formatting-syntax/

https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html

Colab's Markdown Guide: https://colab.research.google.com/notebooks/markdown_guide.ipynb#scrollTo=5Y3CStVkLxqt

For this exercise add a new 'Text' cell and try to recreate the following block of text below the two lines. 
<hr><hr>

![example markdown](images/markdown-example.png)




We can start with a few different paragraphs of text. The first paragraph will have a few sentences with various markups found. 
Things like **bold**, *italics*, ~~strikethrough~~, and even `monospace`.

Here is another paragraph of text that contains a url [https://mtu.edu](https://mtu.edu).

we can have lists:

* one
* two
* three

And more lists:

1. one
2. two
3. three

Nested lists:

* one
  * one A
  * one B
* two
* three


<!-- END QUESTION -->

## Example 4 - String Operations 

Here you can see some more operations working with strings.

https://docs.python.org/3/library/stdtypes.html#str

In [18]:
str = "Hello Data Science 2024"

In [19]:
print(str.find("2024"))

19


In [20]:
print(str[-4:])

2024


In [21]:
str.upper()

'HELLO DATA SCIENCE 2024'

In [22]:
str.lower()

'hello data science 2024'

In [23]:
str + ' & ' + 'FutureDataScientist'

'Hello Data Science 2024 & FutureDataScientist'

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

# Pandas Review 

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will review commonly used data-wrangling operations/tools in `pandas`. We aim to give you familiarity with:

* Creating `DataFrames`,
* Slicing `DataFrames` (i.e., selecting rows and columns)
* Filtering data (using boolean arrays)

In this lab, you are going to use several `pandas` methods. Reminder from lecture that you may press `shift+tab` on method parameters to see the documentation for that method. For example, if you were using the `drop` method in `pandas`, you could press `shift+tab` to see what `drop` is expecting.

**Note**: The `pandas` interface is notoriously confusing for beginners, and the documentation is not consistently great. Throughout the semester, you will have to search through [`pandas` documentation](https://pandas.pydata.org/docs/reference/index.html) and experiment, but remember it is part of the learning experience and will help shape you as a data scientist!




---


### **REVIEW:** Creating `DataFrames` & Basic Manipulations

Recall that a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe) is a table in which each column has a specific data type; there is an index over the columns (typically string labels) and an index over the rows (typically ordinal numbers).

Usually, you'll create `DataFrames` by using a function like `pd.read_csv`. However, in this section, we'll discuss how to create them from scratch.

The [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for the `pandas` `DataFrame` class provides several constructors for the `DataFrame` class.

**Syntax 1:** You can create a `DataFrame` by specifying the columns and values using a dictionary, as shown below. 

The keys of the dictionary are the column names, and the values of the dictionary are lists containing the row entries.

In [24]:
fruit_info = pd.DataFrame(
    data = {'fruit': ['apple', 'orange', 'banana', 'raspberry'],
          'color': ['red', 'orange', 'yellow', 'pink'],
          'price': [1.0, 0.75, 0.35, 0.05]
          })
fruit_info

Unnamed: 0,fruit,color,price
0,apple,red,1.0
1,orange,orange,0.75
2,banana,yellow,0.35
3,raspberry,pink,0.05


**Syntax 2:** You can also define a `DataFrame` by specifying the rows as shown below. 

Each row corresponds to a distinct tuple, and the columns are specified separately.

In [25]:
fruit_info2 = pd.DataFrame(
    [("red", "apple", 1.0), ("orange", "orange", 0.75), ("yellow", "banana", 0.35),
     ("pink", "raspberry", 0.05)], 
    columns = ["color", "fruit", "price"])
fruit_info2

Unnamed: 0,color,fruit,price
0,red,apple,1.0
1,orange,orange,0.75
2,yellow,banana,0.35
3,pink,raspberry,0.05


You can also convert the entire `DataFrame` into a two-dimensional `NumPy` array. Remember that a `NumPy` array can hold homogenous data whereas a `DataFrame` can contain heterogeneous data. 

In [26]:
numbers = pd.DataFrame({"A":[1, 2, 3], "B":[0, 1, 1]})
numpy_numbers = numbers.to_numpy()

print(type(numpy_numbers))
print(numpy_numbers)

<class 'numpy.ndarray'>
[[1 0]
 [2 1]
 [3 1]]


The `values` attribute returns the content of the `DataFrame` in the form of a list of lists.

In [27]:
fruit_info.values

array([['apple', 'red', 1.0],
       ['orange', 'orange', 0.75],
       ['banana', 'yellow', 0.35],
       ['raspberry', 'pink', 0.05]], dtype=object)

## **REVIEW:** Selecting Rows and Columns in `pandas`

As you've seen in lecture, there are two verbose operators in Python for selecting rows: `loc` and `iloc`. Let's review them briefly.

#### **Approach 1:** `loc`

The first of the two verbose operators is `loc`, which takes two arguments. The first is one or more **row labels**, the second is one or more **column labels** - both of which are displayed in bold to the left of each of the rows and above each of the columns, respectively. These are not the same as positional indices, which are used for indexing Python lists or `NumPy` arrays!

The desired rows or columns can be provided individually, in slice notation, or as a list. Some examples are given below.

Note that **slicing in `loc` is inclusive** on the provided labels.

In [28]:
# Get rows 0 through 2 (inclusive) with labels 'fruit' through 'price' 
#  (which would include the color column that is in between both labels)
fruit_info.loc[0:2, 'fruit':'price']

Unnamed: 0,fruit,color,price
0,apple,red,1.0
1,orange,orange,0.75
2,banana,yellow,0.35


In [29]:
# Get rows 0 through 2 (inclusive) and columns 'fruit' and 'price'. 
# Note the difference in notation and result from the previous example.
fruit_info.loc[0:2, ['fruit', 'price']]

Unnamed: 0,fruit,price
0,apple,1.0
1,orange,0.75
2,banana,0.35


In [30]:
# Get rows 0 and 2 and columns fruit and price. 
fruit_info.loc[[0, 2], ['fruit', 'price']]

Unnamed: 0,fruit,price
0,apple,1.0
2,banana,0.35


In [31]:
# Get rows 0 and 2 and column fruit
fruit_info.loc[[0, 2], ['fruit']]

Unnamed: 0,fruit
0,apple
2,banana


Note that if we request a single column but don't enclose it in a list, the return type of the `loc` operator is a `Series` rather than a `DataFrame`. 

In [32]:
# Get rows 0 and 2 and column fruit, returning the result as a Series
fruit_info.loc[[0, 2], 'fruit']

0     apple
2    banana
Name: fruit, dtype: object

If we provide only one argument to `loc`, it uses the provided argument to select rows, and returns all columns.

In [33]:
fruit_info.loc[0:1]

Unnamed: 0,fruit,color,price
0,apple,red,1.0
1,orange,orange,0.75


Note that if you try to access columns without providing rows, `loc` will crash. 

In [34]:
# Uncomment, this code will crash
#fruit_info.loc[["fruit", "price"]]

# Uncomment, this code works fine: 
fruit_info.loc[:, ["fruit", "price"]]

Unnamed: 0,fruit,price
0,apple,1.0
1,orange,0.75
2,banana,0.35
3,raspberry,0.05


<br/>

---

### **Approach 2:** `iloc`

`iloc` is very similar to `loc` except that its arguments are **row numbers** and **column numbers**, rather than row and column labels. A useful mnemonic is that the `i` stands for "integer". This is quite similar to indexing into a Python `list` or `NumPy` array.

In addition, **slicing for `iloc` is exclusive** on the provided integer indices. Some examples are given below:

In [35]:
# Get rows 0 through 3 (exclusive) and columns 0 through 3 (exclusive)
fruit_info.iloc[0:3, 0:3]

Unnamed: 0,fruit,color,price
0,apple,red,1.0
1,orange,orange,0.75
2,banana,yellow,0.35


In [36]:
# Get rows 0 through 3 (exclusive) and columns 0 and 2.
fruit_info.iloc[0:3, [0, 2]]

Unnamed: 0,fruit,price
0,apple,1.0
1,orange,0.75
2,banana,0.35


In [37]:
# Get rows 0 and 2 and columns 0 and 2.
fruit_info.iloc[[0, 2], [0, 2]]

Unnamed: 0,fruit,price
0,apple,1.0
2,banana,0.35


In [38]:
# Get rows 0 and 2 and column fruit
fruit_info.iloc[[0, 2], [0]]

Unnamed: 0,fruit
0,apple
2,banana


In [39]:
# Get rows 0 and 2 and column fruit
fruit_info.iloc[[0, 2], 0]

0     apple
2    banana
Name: fruit, dtype: object

Note that in these `loc` and `iloc` examples above, the row **label** and row **number** were always the same.

Let's see an example where they are different. If we sort our fruits by price, we get:

In [40]:
fruit_info_sorted = fruit_info.sort_values("price")
fruit_info_sorted

Unnamed: 0,fruit,color,price
3,raspberry,pink,0.05
2,banana,yellow,0.35
1,orange,orange,0.75
0,apple,red,1.0


After sorting, note how row number 0 now has index label 3, row number 1 now has index label 2, etc. These indices are the arbitrary numerical indices generated when we created the `DataFrame`. For example, `banana` was originally in row 2, and so it has row label 2. Note the distinction between the index _label_, and the actual index _position_.

If we request the rows in positions 0 and 2 using `iloc`, we're indexing using the row NUMBERS, not labels. 

In [41]:
fruit_info_sorted.iloc[[0, 2], 0]

3    raspberry
1       orange
Name: fruit, dtype: object

Lastly, similar to `loc`, the second argument to `iloc` is optional. That is, if you provide only one argument to `iloc`, it treats the argument you provide as a set of desired row numbers, not column numbers.

In [42]:
fruit_info_sorted.iloc[[0, 2]]

Unnamed: 0,fruit,color,price
3,raspberry,pink,0.05
1,orange,orange,0.75


<br>

---

### **Approach 3** `[]` Notation

`pandas` also supports the `[]` operator. It's similar to `loc` in that it lets you access rows and columns by their name.

However, unlike `loc`, which takes row names and also optionally column names, `[]` is more flexible. If you provide it only row names, it'll give you rows (same behavior as `loc`), and if you provide it with only column names, it'll give you columns (whereas `loc` will crash).

Some examples:

In [43]:
fruit_info[0:2]

Unnamed: 0,fruit,color,price
0,apple,red,1.0
1,orange,orange,0.75


In [44]:
# Here we're providing a list of fruits as single argument to []
fruit_info[["fruit", "color", "price"]]

Unnamed: 0,fruit,color,price
0,apple,red,1.0
1,orange,orange,0.75
2,banana,yellow,0.35
3,raspberry,pink,0.05


Note that slicing notation is not supported for columns if you use `[]` notation. Use `loc` instead.

In [45]:
# Uncomment and this code crashes
#fruit_info["fruit":"price"]

# Uncomment and this works fine
fruit_info.loc[:, "fruit":"price"]

Unnamed: 0,fruit,color,price
0,apple,red,1.0
1,orange,orange,0.75
2,banana,yellow,0.35
3,raspberry,pink,0.05


`[]` and `loc` are quite similar. For example, the following two pieces of code are functionally equivalent for selecting the fruit and price columns.

1. `fruit_info[["fruit", "price"]]` 
2. `fruit_info.loc[:, ["fruit", "price"]]`.

Because it yields more concise code, you'll find that our code and your code both tend to feature `[]`. However, there are some subtle pitfalls of using `[]`. If you're ever having performance issues, weird behavior, or you see a `SettingWithCopyWarning` in `pandas`, switch from `[]` to `loc`, and this may help.

To avoid getting too bogged down in indexing syntax, we'll avoid a more thorough discussion of `[]` and `loc`. We may return to this at a later point in the course.

For more on `[]` vs. `loc`, you may optionally try reading:
1. https://stackoverflow.com/questions/48409128/what-is-the-difference-between-using-loc-and-using-just-square-brackets-to-filte
2. https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc/65875826#65875826
3. https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas/53954986#53954986

## Q3 - Pandas 

Pandas Resources:
* https://pandas.pydata.org/
* https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

We are going to be using the Abalone data set.  This is part of the UCI Machine Learning repository.  A common place to find data sets to test out code and used in learning about machine learning and data science. 

I have already downloaded the data from https://archive.ics.uci.edu/dataset/1/abalone
 
In the next cell, you will modify the code to read in the `abalone.data` file properly.  Use the following names for the columns:  
`sex`, `len`, `diam`, `hgt`, `wh_wgt`, `shuck_wgt`, `vis_wgt`, `sh_wgt`, `rings`

*HINT:* You will need to look at using additional parameters for the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function. It will be helpful to look at the documentation on `read_csv`   
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html



In [46]:
colNames = ['sex','len','diam','hgt','wh_wgt','shuck_wgt','vis_wgt','sh_wgt','rings']
df = pd.read_csv("./data/abalone.data", sep=',', header = None, names = colNames)  # modify this code to properly read the data
# specify the column names using the information above as an argument to the 
# read_csv function

df.head()

Unnamed: 0,sex,len,diam,hgt,wh_wgt,shuck_wgt,vis_wgt,sh_wgt,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [47]:
grader.check("q3")

## Q4 - Pandas 

Here you will explore properties of the DataFrame and its attributes.

In [48]:
# Determine the number of rows and columns of the data set 
rows = len(df) #finds number of rows in dataframe
columns = len(df.columns) #finds length of columns for the dataframe

# Determine what are the column names 
dfColumnNames = df.columns

print(f'No of rows: {rows}')
print(f'No of columns: {columns}')
dfColumnNames

No of rows: 4177
No of columns: 9


Index(['sex', 'len', 'diam', 'hgt', 'wh_wgt', 'shuck_wgt', 'vis_wgt', 'sh_wgt',
       'rings'],
      dtype='object')

In [49]:
grader.check("q4")

## Q5 - Pandas 

Show the first 4 rows of the DataFrame.

Show the last 7 rows of the DataFrame.

In [50]:
first_4_rows = df.head(4)
last_7_rows = df.tail(7)
print(first_4_rows)
print(last_7_rows)

  sex    len   diam    hgt  wh_wgt  shuck_wgt  vis_wgt  sh_wgt  rings
0   M  0.455  0.365  0.095  0.5140     0.2245   0.1010   0.150     15
1   M  0.350  0.265  0.090  0.2255     0.0995   0.0485   0.070      7
2   F  0.530  0.420  0.135  0.6770     0.2565   0.1415   0.210      9
3   M  0.440  0.365  0.125  0.5160     0.2155   0.1140   0.155     10
     sex    len   diam    hgt  wh_wgt  shuck_wgt  vis_wgt  sh_wgt  rings
4170   M  0.550  0.430  0.130  0.8395     0.3155   0.1955  0.2405     10
4171   M  0.560  0.430  0.155  0.8675     0.4000   0.1720  0.2290      8
4172   F  0.565  0.450  0.165  0.8870     0.3700   0.2390  0.2490     11
4173   M  0.590  0.440  0.135  0.9660     0.4390   0.2145  0.2605     10
4174   M  0.600  0.475  0.205  1.1760     0.5255   0.2875  0.3080      9
4175   F  0.625  0.485  0.150  1.0945     0.5310   0.2610  0.2960     10
4176   M  0.710  0.555  0.195  1.9485     0.9455   0.3765  0.4950     12


In [51]:
grader.check("q5")

## Q6 - Pandas 

Practice selecting different parts of the DataFrame

Select the `sh_wgt` column

Then, select both the `diam` and `vis_wgt` columns and only the first eight rows.

In [52]:
# select just the sh_wgt column 
shell_wgt = df['sh_wgt']
shell_wgt

0       0.1500
1       0.0700
2       0.2100
3       0.1550
4       0.0550
         ...  
4172    0.2490
4173    0.2605
4174    0.3080
4175    0.2960
4176    0.4950
Name: sh_wgt, Length: 4177, dtype: float64

In [53]:
diam_vis_wgt = df[['diam','vis_wgt']]
diam_vis_wgt

Unnamed: 0,diam,vis_wgt
0,0.365,0.1010
1,0.265,0.0485
2,0.420,0.1415
3,0.365,0.1140
4,0.255,0.0395
...,...,...
4172,0.450,0.2390
4173,0.440,0.2145
4174,0.475,0.2875
4175,0.485,0.2610


In [54]:
grader.check("q6")

## Q7 - Pandas 

Select the following using the `.iloc` function: 
* `index_6` - row with index=5, the 6th row, of the DataFrame 
* `row_5_7` - the 5th and 7th row of the DataFrame, and 
* `ansC` - every other row and every third column starting from the 2nd row and 3rd column



In [55]:
df.head(6)

Unnamed: 0,sex,len,diam,hgt,wh_wgt,shuck_wgt,vis_wgt,sh_wgt,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
5,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


In [56]:
index_6 = df.iloc[5]
index_6

sex               I
len           0.425
diam            0.3
hgt           0.095
wh_wgt       0.3515
shuck_wgt     0.141
vis_wgt      0.0775
sh_wgt         0.12
rings             8
Name: 5, dtype: object

In [57]:
row_5_7 = df.iloc[[5,6]]
row_5_7

Unnamed: 0,sex,len,diam,hgt,wh_wgt,shuck_wgt,vis_wgt,sh_wgt,rings
5,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
6,F,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20


In [58]:
ansC = df.iloc[2::2, 2::3]  # found from this link https://stackoverflow.com/questions/25055712/pandas-every-nth-row, could just do f,f,t,f,f,t,f,f,t for the column part. 
ansC

Unnamed: 0,diam,shuck_wgt,rings
2,0.420,0.2565,9
4,0.255,0.0895,7
6,0.415,0.2370,20
8,0.370,0.2165,9
10,0.380,0.1940,14
...,...,...,...
4168,0.400,0.2865,8
4170,0.430,0.3155,10
4172,0.450,0.3700,11
4174,0.475,0.5255,9


In [59]:
grader.check("q7")

## Q8 - Data Selection and Statistics 

Perform `mean()`, `max()`, and `min()`  for first 12 data points for all the weight columns.

*Hint: remember df.head(10) returns the first 10 rows of the DataFrame*

In [60]:
meanVals =  df.head(12).iloc[:,4:8].mean()
meanVals

wh_wgt       0.537583
shuck_wgt    0.204167
vis_wgt      0.108750
sh_wgt       0.181667
dtype: float64

In [61]:
maxVals = df.head(12).iloc[:,4:8].max()
maxVals

wh_wgt       0.8945
shuck_wgt    0.3145
vis_wgt      0.1510
sh_wgt       0.3300
dtype: float64

In [62]:
minVals = df.head(12).iloc[:,4:8].min()
minVals

wh_wgt       0.2050
shuck_wgt    0.0895
vis_wgt      0.0395
sh_wgt       0.0550
dtype: float64

In [63]:
grader.check("q8")

## Q9 - Data Selection and Statistics 

Group by column "sex" and find the median for the other variables. 

In [65]:
group =  df.groupby('sex').median() # had problem for a while because I was selecting all of them like the example in the week2notes, but then saw taking the whole df can be done like this. 
group

Unnamed: 0_level_0,len,diam,hgt,wh_wgt,shuck_wgt,vis_wgt,sh_wgt,rings
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,0.59,0.465,0.16,1.0385,0.4405,0.224,0.295,10.0
I,0.435,0.335,0.11,0.384,0.16975,0.0805,0.113,8.0
M,0.58,0.455,0.155,0.97575,0.42175,0.21,0.276,10.0


In [66]:
grader.check("q9")

## Q10 - Data Selection and Statistics 

Find the mean weights of abolone with more than 12 rings. 

In [78]:
mean_vals = df[df['rings'] > 12].iloc[:,4:8].mean()
mean_vals

wh_wgt       1.119511
shuck_wgt    0.432494
vis_wgt      0.238449
sh_wgt       0.350519
dtype: float64

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [79]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)

Running your submission against local test cases...



Your submission received the following results when run against available test cases:

    q1 results: All test cases passed!

    q3 results: All test cases passed!

    q4 results: All test cases passed!

    q5 results: All test cases passed!

    q6 results: All test cases passed!

    q7 results: All test cases passed!

    q8 results: All test cases passed!

    q9 results: All test cases passed!
