# Looking at DataFrame Data

1. Run the cell below to import required libraries and create a DataFrame

In [1]:
import pandas as pd
import numpy as np
import random

num_rows = 100
colors = ['Red', 'Blue', 'Green']

df = pd.DataFrame( {'color': [colors[random.randint(0,2)] for _ in range(num_rows)],
                    'integers': [random.randint(0,15) for _ in range(num_rows)],
                    'floats': [random.random() for _ in range(num_rows)]})
df

Unnamed: 0,color,integers,floats
0,Blue,1,0.066970
1,Blue,10,0.110125
2,Green,8,0.627695
3,Blue,6,0.707196
4,Red,12,0.508152
...,...,...,...
95,Red,9,0.950588
96,Blue,0,0.800393
97,Green,1,0.303414
98,Blue,2,0.612543


2. Use the DataFrame `head()` method to view the top five rows. Try giving it a number as an argument to control how many rows are displayed.

In [2]:
df.head()

Unnamed: 0,color,integers,floats
0,Blue,1,0.06697
1,Blue,10,0.110125
2,Green,8,0.627695
3,Blue,6,0.707196
4,Red,12,0.508152


3. View summary statistics using the DataFrame `describe()` method.

In [3]:
df.describe()

Unnamed: 0,integers,floats
count,100.0,100.0
mean,7.06,0.454524
std,4.771168,0.296424
min,0.0,0.013255
25%,2.0,0.155303
50%,7.0,0.425715
75%,11.25,0.691859
max,15.0,0.9973


4. The `decribe()` method accepts some optional arguments, including 'include' and 'exclude'. By default, `describe()` only shows statistics for columns with numerical data, but if you add the argument `include=np.object`, it will display statistics for columns with string data. Try this.

In [4]:
df.describe(include=np.object)

Unnamed: 0,color
count,100
unique,3
top,Red
freq,39


5. If you change the argument to `include='all'`, it will display statistics for all columns in the data frame, inserting `NaN` (not a number) when the data type is not appropriate for the statistic. Try viewing statistics for all frames using `describe()`.

In [5]:
df.describe(include='all')

Unnamed: 0,color,integers,floats
count,100,100.0,100.0
unique,3,,
top,Red,,
freq,39,,
mean,,7.06,0.454524
std,,4.771168,0.296424
min,,0.0,0.013255
25%,,2.0,0.155303
50%,,7.0,0.425715
75%,,11.25,0.691859


## Selecting Data
6. You can select a column using bracket syntax very similar to that used with dictionaries. Put the column name, as a string, in brackets after the DataFrame name. Try this with the column 'color'

In [8]:
display(df.head())
df['color']

Unnamed: 0,color,integers,floats
0,Blue,1,0.06697
1,Blue,10,0.110125
2,Green,8,0.627695
3,Blue,6,0.707196
4,Red,12,0.508152


0      Blue
1      Blue
2     Green
3      Blue
4       Red
      ...  
95      Red
96     Blue
97    Green
98     Blue
99      Red
Name: color, Length: 100, dtype: object

7. Try selecting the columns 'color' and 'floats' by supplying them as a list of strings in the same bracket syntax.

In [9]:
df[['color', 'floats']].head(10)

Unnamed: 0,color,floats
0,Blue,0.06697
1,Blue,0.110125
2,Green,0.627695
3,Blue,0.707196
4,Red,0.508152
5,Red,0.60865
6,Green,0.344429
7,Red,0.155338
8,Red,0.906619
9,Green,0.215086


8. The bracket syntax in DataFrames is overloaded to select rows as well. Selecting rows uses the syntax we used to select slices in Sequences: a start number, a colon, and an upper bound number. Try selecting three rows from the DataFrame using the slice `10:13`

In [10]:
df.iloc[10:13, :]

Unnamed: 0,color,integers,floats
10,Green,7,0.681549
11,Green,3,0.756795
12,Green,6,0.070804


9. Now let's try the `.loc[]` syntax. It also uses bracket syntax, but in this case you will specify both rows and columns to select. Select all of the rows by supplying a lone colon as the first argument, and the column 'color' by supplying it as a second argument (remember that arguments must be separted by a comma).

In [11]:
df.loc[:, 'color']

0      Blue
1      Blue
2     Green
3      Blue
4       Red
      ...  
95      Red
96     Blue
97    Green
98     Blue
99      Red
Name: color, Length: 100, dtype: object

10. Now specify a slice, `10:13`, for the first argument and a list of columns, `['color', 'integers']`, as a second, to select **four** rows (the upper bound in `loc[]` is included) and two columns.

In [12]:
df.loc[10:13, ['color', 'integers']]

Unnamed: 0,color,integers
10,Green,7
11,Green,3
12,Green,6
13,Green,13


11. Now try the `iloc[]` syntax. This used the position of rows and columns to determine selection. In this DataFrame, the labels for the rows are the same as their position, so we can use the same slice `10:13` as the first argument. For the second, use the slice `0:2` to select the first two columns. Notice that with `iloc[]`, the upper bound is not inclusive, so you will get three rows and two columns.

In [13]:
df.iloc[10:13, 0:2]

Unnamed: 0,color,integers
10,Green,7
11,Green,3
12,Green,6
