In this assignment, we will measures factors related to predicting the age of abalone, which are marine snails, using 4177 samples. These factors include, among other things, sex, length of shell, diameter of shell, shell weight, number of rings, etc. More detail about the data file is available at the end of this assignment - you will need this information to complete the assignment. Note the benefit of data frames - reading is simple, and all of this can be done without a single loop.

Here is the data file <a href= "https://drive.google.com/drive/folders/1368ioOeVzEA5yj3Gyv0mhi9_Q6KUr3ji">[archived copy]</a> or from the original source [abalone.data in the downloads folder of <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/">this link</a>]. 

Link to the data frames tutorial material: [<a href = "https://docs.google.com/document/d/1phdIyrb1RdF05sUvWFNd1F_OO5W7G7kobB12IhOnmJI/edit">Data frames tutorial</a>]. 

## Perform the following steps: 

### 1. Open the file in a text editor while you read over the column names in the data file details at the end of this assignment.
<br><font color="red">
Note, the extension is '.data’, and even though it is a text file it your operating system may not recognize it as one. One option is to change the extension (to something like '.txt’). Another is to start your text editor and open the file through the editor. In any case doing this will give you a chance to observe the structure of the data while you read the data file details.</font>

### 2. Read the text file in as a dataframe. And then print it to make sure it read in correctly. Keep in mind this can be a tricky, error-prone problem depending on the formatting of the data file.

Use the method you used previously to read in a CSV file. There are at least two potential problems to consider. Note, if the tutorial is insufficient you can simply read the function documentation online for how to handle the following issues.

   <ol>
    <li> <b>The delimiter is a space rather than a comma</b>: One option is to open the file in a text editor, and use the Find/Replace All option to change every space to a single comma. However, that’s cumbersome and error-prone. A better option is to just use the read CSV function and change the separator to a space rather than a comma. </li>
    <li><b>There is no header line</b>: You have a description of each column, but the file does not have column headers. You may need to make sure the read function is set up so it does not expect column headers. However, a more common option is to supply your own 9 item list of strings to serve as column headers - you can use the one below as an example. Using names of columns rather than strings will make it much easier to read your code later and is highly recommended for this step.</li>
       </ol><br>

['sex', 'length', 'diameter', 'height', 'whole weight', 'shucked weight', 'viscera weight', 'shell weight', 'rings']

In [3]:
import pandas as pd
abalone_pd = pd.read_csv('abalone.data',names=['sex','length','diameter','height','whole weight','shucked weight','viscera weight','shell weight','rings'])
print(abalone_pd)

     sex  length  diameter  height  whole weight  shucked weight  \
0      M   0.455     0.365   0.095        0.5140          0.2245   
1      M   0.350     0.265   0.090        0.2255          0.0995   
2      F   0.530     0.420   0.135        0.6770          0.2565   
3      M   0.440     0.365   0.125        0.5160          0.2155   
4      I   0.330     0.255   0.080        0.2050          0.0895   
5      I   0.425     0.300   0.095        0.3515          0.1410   
6      F   0.530     0.415   0.150        0.7775          0.2370   
7      F   0.545     0.425   0.125        0.7680          0.2940   
8      M   0.475     0.370   0.125        0.5095          0.2165   
9      F   0.550     0.440   0.150        0.8945          0.3145   
10     F   0.525     0.380   0.140        0.6065          0.1940   
11     M   0.430     0.350   0.110        0.4060          0.1675   
12     M   0.490     0.380   0.135        0.5415          0.2175   
13     F   0.535     0.405   0.145        0.6845

### 3. Now let’s get a sense of the length of the abalones. First print only the lengths of the abalones by selecting only that column. Then calculate the mean and standard deviation of the lengths.

There are two ways you can calculate the mean. The easiest is to select the length column and use the mean function that can instantly be performed on columns in a data frame. The second option is to convert the column of data into an array, then perform the operation as you would for an array. Do this for the mean and the standard deviation.

(Optional) you could create a histogram with this data too. For that you may need to convert the data to a list or array after selecting that column.

In [8]:
import numpy as np

#Selecting the length of the abalones
print(abalone_pd["length"])

#Mean of the lengths
length_mean = abalone_pd["length"].mean()
print()
print('Mean of the lengths: ' + str(length_mean))

#Standard deviation of the lengths
length_sd = np.std(abalone_pd["length"])
print()
print('Standard deviation of the lengths: ' + str(length_sd))

0       0.455
1       0.350
2       0.530
3       0.440
4       0.330
5       0.425
6       0.530
7       0.545
8       0.475
9       0.550
10      0.525
11      0.430
12      0.490
13      0.535
14      0.470
15      0.500
16      0.355
17      0.440
18      0.365
19      0.450
20      0.355
21      0.380
22      0.565
23      0.550
24      0.615
25      0.560
26      0.580
27      0.590
28      0.605
29      0.575
        ...  
4147    0.695
4148    0.770
4149    0.280
4150    0.330
4151    0.350
4152    0.370
4153    0.430
4154    0.435
4155    0.440
4156    0.475
4157    0.475
4158    0.480
4159    0.560
4160    0.585
4161    0.585
4162    0.385
4163    0.390
4164    0.390
4165    0.405
4166    0.475
4167    0.500
4168    0.515
4169    0.520
4170    0.550
4171    0.560
4172    0.565
4173    0.590
4174    0.600
4175    0.625
4176    0.710
Name: length, Length: 4177, dtype: float64

Mean of the lengths: 0.5239920995930099

Standard deviation of the lengths: 0.1200785362060288


### 4. Create a data frame called low_rings_df with all of the abalones that have number of rings below 10. Similarly create a data frame called high_rings_df with abalones with 10 or more rings. Now calculate the mean length, diameter, height, and whole weight for both groups and note the trend as abalones acquire more rings.

In [12]:
low_rings_df = abalone_pd[abalone_pd["rings"]<10]
low_length_mean = low_rings_df["length"].mean()
low_diameter_mean = low_rings_df["diameter"].mean()
low_height_mean = low_rings_df["height"].mean()
low_wweight_mean = low_rings_df["whole weight"].mean()

print('Mean length of the low-ringed abalones: ' + str(low_length_mean))
print('Mean diameter of the low-ringed abalones: ' + str(low_diameter_mean))
print('Mean height of the low-ringed abalones: ' + str(low_height_mean))
print('Mean whole weight of the low-ringed abalones: ' + str(low_wweight_mean))

print()
print()

high_rings_df = abalone_pd[abalone_pd["rings"] >= 10]
high_length_mean = high_rings_df["length"].mean()
high_diameter_mean = high_rings_df["diameter"].mean()
high_height_mean = high_rings_df["height"].mean()
high_wweight_mean = high_rings_df["whole weight"].mean()

print('Mean length of the high-ringed abalones: ' + str(high_length_mean))
print('Mean diameter of the high-ringed abalones: ' + str(high_diameter_mean))
print('Mean height of the high-ringed abalones: ' + str(high_height_mean))
print('Mean whole weight of the high-ringed abalones: ' + str(high_wweight_mean))

Mean length of the low-ringed abalones: 0.4623687977099238
Mean diameter of the low-ringed abalones: 0.355443702290077
Mean height of the low-ringed abalones: 0.11847089694656505
Mean whole weight of the low-ringed abalones: 0.5703184637404581


Mean length of the high-ringed abalones: 0.5860595867371469
Mean diameter of the high-ringed abalones: 0.46069678039404094
Mean height of the high-ringed abalones: 0.16071359923113884
Mean whole weight of the high-ringed abalones: 1.0890285920230636


Such information provides indications of which features can be used to predict the age of abalones.

<b>Data file details (the meaning of each column)</b>
 
  Given is the attribute name, attribute type, the measurement unit and a
   brief description.

	Name		Data Type	   Meas.	Description
	----		---------    -----	     -----------
	Sex		     nominal			      M, F, and I (infant)
	Length	    continuous	   mm	     Longest shell measurement
	Diameter	continuous	   mm	     perpendicular to length
	Height		continuous	   mm	     with meat in shell
	Whole weight	continuous	grams	whole abalone
	Shucked weight	continuous	grams	weight of meat
	Viscera weight	continuous	grams	gut weight (after bleeding)
	Shell weight	continuous	grams	after being dried
	Rings		integer			+1.5 gives the age in years