**First Tutorial: [Munging Data](http://wavedatalab.github.io/datawithpython/munge.html#Read-the-csv-file-of-your-choice-using-Pandas)**

*Key Takeaways:*
1. If you change an object by running a cell that has an error, even if you correct that error, you'll still need to go back and reinitialize that object for the problematic cell to run properly.
2. How to import pandas and numpy in Terminal using pip
3. Generally, basic ways of looking at datasets in ways that make it more manageable
4. A couple of the commands were outdated, so I learned more about how to Google what I wanted. For example, `order` is now `sort_values` for Series, and `ix` is now `iloc` or `loc`
    - In doing so, I also learned what "deprecated" means when it comes to programming
    
*Mini Investigation: `ix` vs `iloc` vs `loc`*

In [1]:
print("Hello, World")

Hello, World


In [3]:
import numpy as np
import pandas as pd

In [24]:
cereal = pd.read_csv("cereal.csv")

In [25]:
cereal.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [26]:
cereal.shape # gives #rows, #cols

(77, 16)

In [27]:
len(cereal) # number of rows, basically number of samples

77

In [28]:
cereal.columns # returns column names

Index(['name', 'mfr', 'type', 'calories', 'protein', 'fat', 'sodium', 'fiber',
       'carbo', 'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups',
       'rating'],
      dtype='object')

In [29]:
cereal['name'][:5] # get first five rows of a column by name

0                    100% Bran
1            100% Natural Bran
2                     All-Bran
3    All-Bran with Extra Fiber
4               Almond Delight
Name: name, dtype: object

In [30]:
cereal[:5] # same as cereal.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [35]:
calorieranges = pd.cut(cereal['rating'],10) # divides rating col into 10 equal ranges
calorieranges.head()

0     (63.44, 71.006]
1    (33.175, 40.741]
2     (55.874, 63.44]
3    (86.139, 93.705]
4    (33.175, 40.741]
Name: rating, dtype: category
Categories (10, interval[float64]): [(17.967, 25.609] < (25.609, 33.175] < (33.175, 40.741] < (40.741, 48.308] ... (63.44, 71.006] < (71.006, 78.572] < (78.572, 86.139] < (86.139, 93.705]]

In [36]:
pd.value_counts(calorieranges)

(33.175, 40.741]    22
(25.609, 33.175]    14
(48.308, 55.874]    12
(40.741, 48.308]    11
(55.874, 63.44]      6
(17.967, 25.609]     6
(63.44, 71.006]      3
(71.006, 78.572]     2
(86.139, 93.705]     1
(78.572, 86.139]     0
Name: rating, dtype: int64

In [40]:
cereal.iloc[0,0:6] # gives first six columns of first (zero-th) row

name        100% Bran
mfr                 N
type                C
calories           70
protein             4
fat                 1
Name: 0, dtype: object

In [52]:
cereal['calories'].sort_values()[:5] # sorts entries by value in calories column

3     50
54    50
55    50
0     70
2     70
Name: calories, dtype: int64

In [56]:
cereal.dtypes # gives datatypes of each column

name         object
mfr          object
type         object
calories      int64
protein       int64
fat           int64
sodium        int64
fiber       float64
carbo       float64
sugars        int64
potass        int64
vitamins      int64
shelf         int64
weight      float64
cups        float64
rating      float64
dtype: object

In [60]:
cereal['mfr'].unique() # returns unique values for column by name

array(['N', 'Q', 'K', 'R', 'G', 'P', 'A'], dtype=object)

In [62]:
len(cereal['mfr'].unique()) # returns number of unique values for a column

7

In [66]:
cereal.loc[0:3,'calories'] # index into certain column by name and get first four rows

0     70
1    120
2     70
3     50
Name: calories, dtype: int64

In [68]:
cereal.loc[0:3,'type'] == "C" # obtain binary values: true if type is C (cold)

0    True
1    True
2    True
3    True
Name: type, dtype: bool