# Assessing and Building Intuition
Once you have your data loaded into dataframes, Pandas makes a quick investigation of the data really easy. Let's explore some helpful methods for assessing and building intuition about a dataset. We can use the cancer data from before to help us.

In [1]:
import pandas as pd

df = pd.read_csv('cancer_data.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_max,texture_max,perimeter_max,area_max,smoothness_max,compactness_max,concavity_max,concave_points_max,symmetry_max,fractal_dimension_max
0,842302,M,17.99,,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [2]:
# this returns a tuple of the dimensions of the dataframe
df.shape

(569, 32)

In [3]:
# this returns the datatypes of the columns
df.dtypes

id                          int64
diagnosis                  object
radius_mean               float64
texture_mean              float64
perimeter_mean            float64
area_mean                 float64
smoothness_mean           float64
compactness_mean          float64
concavity_mean            float64
concave_points_mean       float64
symmetry_mean             float64
fractal_dimension_mean    float64
radius_SE                 float64
texture_SE                float64
perimeter_SE              float64
area_SE                   float64
smoothness_SE             float64
compactness_SE            float64
concavity_SE              float64
concave_points_SE         float64
symmetry_SE               float64
fractal_dimension_SE      float64
radius_max                float64
texture_max               float64
perimeter_max             float64
area_max                  float64
smoothness_max            float64
compactness_max           float64
concavity_max             float64
concave_points

In [4]:
# although the datatype for diagnosis appears to be object, further
# investigation shows it's a string
type(df['diagnosis'][0])

str

Pandas actually stores [pointers](https://en.wikipedia.org/wiki/Pointer_(computer_programming) to strings in dataframes and series, which is why `object` instead of `str` appears as the datatype. Understanding this is not essential for data analysis - just know that strings will appear as objects in Pandas.

In [5]:
# this displays a concise summary of the dataframe,
# including the number of non-null values in each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
id                        569 non-null int64
diagnosis                 569 non-null object
radius_mean               569 non-null float64
texture_mean              548 non-null float64
perimeter_mean            569 non-null float64
area_mean                 569 non-null float64
smoothness_mean           521 non-null float64
compactness_mean          569 non-null float64
concavity_mean            569 non-null float64
concave_points_mean       569 non-null float64
symmetry_mean             504 non-null float64
fractal_dimension_mean    569 non-null float64
radius_SE                 569 non-null float64
texture_SE                548 non-null float64
perimeter_SE              569 non-null float64
area_SE                   569 non-null float64
smoothness_SE             521 non-null float64
compactness_SE            569 non-null float64
concavity_SE              569 non-null float64
conca

In [6]:
# this returns the number of unique values in each column
df.nunique()

id                        562
diagnosis                   2
radius_mean               451
texture_mean              459
perimeter_mean            516
area_mean                 532
smoothness_mean           434
compactness_mean          530
concavity_mean            530
concave_points_mean       536
symmetry_mean             394
fractal_dimension_mean    494
radius_SE                 535
texture_SE                499
perimeter_SE              526
area_SE                   523
smoothness_SE             495
compactness_SE            534
concavity_SE              526
concave_points_SE         500
symmetry_SE               438
fractal_dimension_SE      539
radius_max                452
texture_max               490
perimeter_max             510
area_max                  537
smoothness_max            378
compactness_max           523
concavity_max             532
concave_points_max        485
symmetry_max              449
fractal_dimension_max     530
dtype: int64

In [7]:
# this returns useful descriptive statistics for each column of data
df.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,...,radius_max,texture_max,perimeter_max,area_max,smoothness_max,compactness_max,concavity_max,concave_points_max,symmetry_max,fractal_dimension_max
count,569.0,569.0,548.0,569.0,569.0,521.0,569.0,569.0,569.0,504.0,...,569.0,548.0,569.0,569.0,521.0,569.0,569.0,569.0,504.0,569.0
mean,30514670.0,14.113021,19.293431,91.877909,653.288576,0.096087,0.104536,0.08862,0.048837,0.181091,...,16.261896,25.660803,107.211142,880.163796,0.13209,0.254557,0.271681,0.114377,0.288856,0.084012
std,125041700.0,3.506148,4.327287,24.162787,349.476899,0.013924,0.052674,0.079011,0.038578,0.027899,...,4.841175,6.202916,33.621975,570.498628,0.022685,0.158042,0.208298,0.06576,0.06252,0.018151
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,869104.0,11.7,16.1675,75.17,420.3,0.08605,0.06526,0.02956,0.02036,0.1618,...,13.01,21.0175,84.11,515.3,0.1166,0.146,0.1125,0.06402,0.24765,0.07127
50%,906024.0,13.37,18.785,86.34,551.1,0.09578,0.09453,0.06155,0.0337,0.17895,...,14.97,25.37,97.65,686.5,0.1312,0.2119,0.2267,0.1001,0.28065,0.08004
75%,8910251.0,15.78,21.825,103.8,782.7,0.1048,0.1305,0.1319,0.07404,0.19575,...,18.76,29.675,125.1,1070.0,0.145,0.3399,0.3853,0.1625,0.317525,0.09211
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [None]:
# this returns the first few lines in our dataframe
# by default, it returns the first five
df.head()

In [None]:
# although, you can specify however many rows you'd like returned
df.head(20)

In [None]:
# same thing applies to `.tail()` which returns the last few rows
df.tail(2)

## Indexing and Selecting Data in Pandas
Let's separate this dataframe into three new dataframes - one for each metric (mean, standard error, and maximum). To get the data for each dataframe, we need to select the `id` and `diagnosis` columns, as well as the ten columns for that metric.

In [None]:
# View the index number and label for each column
for i, v in enumerate(df.columns):
    print(i, v)

We can select data using `loc` and `iloc`, which you can read more about [here](https://pandas.pydata.org/pandas-docs/stable/indexing.html). `loc` uses labels of rows or columns to select data, while `iloc` uses the index numbers. We'll use these to index the dataframe below.

In [8]:
# select all the columns from 'id' to the last mean column
df_means = df.loc[:,'id':'fractal_dimension_mean']
df_means.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,fractal_dimension_mean
0,842302,M,17.99,,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999
3,84348301,M,11.42,20.38,77.58,386.1,,0.2839,0.2414,0.1052,0.2597,0.09744
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883


In [9]:
# repeat the step above using index numbers
df_means = df.iloc[:,:11]
df_means.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean
0,842302,M,17.99,,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069
3,84348301,M,11.42,20.38,77.58,386.1,,0.2839,0.2414,0.1052,0.2597
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809


Let's save the dataframe of means for later.

In [10]:
df_means.to_csv('cancer_data_means.csv', index=False)

### Selecting Multiple Ranges in Pandas
Selecting the columns for the mean dataframe was pretty straightforward - the columns we needed to select were all together (`id`, `diagnosis`, and the mean columns). Now we run into a little issue when we try to do the same for the standard errors or maximum values. `id` and `diagnosis` are separated from the rest of the columns we need! We can't specify all of these in one range.

First, try creating the standard error dataframe on your own to see why doing this with just `loc` and `iloc` is an issue. Then, use this [stackoverflow link](https://stackoverflow.com/questions/41256648/select-multiple-ranges-of-columns-in-pandas-dataframe) to learn how to select multiple ranges in Pandas and try it below. By the way, to figure this out myself, I just found this link by googling "how to select multiple ranges df.iloc"

*Hint: You may have to import a new package!*

In [None]:
import numpy as np


# create the standard errors dataframe
df_std=df.loc


# view the first few rows to confirm this was successful
