# Overview

A comphrehensive list of commands to do initial exploration of data. Not all will work for all data. The examples listed are in data frames, which I like to use, and may/probably won't translate to other data structures.

Listed will be the command, what its output should be, what it means, and an example. The initial data in the cell below must be loaded first before loading any examples. 

Make sure that pydataset is installed if it does not load correctly.

`$pip install pydataset`

In [2]:
from pydataset import data

titanic = data('titanic')
print(titanic.columns)

Index(['class', 'age', 'sex', 'survived'], dtype='object')


## .shape

**Output**:  
The dimensions of the dataframe. So, height and width generally. IE, number of rows by number of columns.

**What it's Good For**:  
Inital sizing, seeing how big it's going to be.

**Cons**:  
None.

In [None]:
print("Data's size: \n", titanic.shape)

## .head( n=5 )

Easiest.

**Output**:  
When no arguments are passed, returns the first five rows of the dataframe. If an integer ***n*** is passed as an argument, will return the first ***n*** rows.

**What it's Good For**:  
When you just want to see what the dataset looks like without being overwhelmed by the size. An easy way to see what possible values could be.

**Cons**:  
Can be misleading in that it returns the first n rows. This data does not represent the whole and it's values will likely be heavily skewed and should not be considered a random sample.

In [3]:
print("First five rows: \n", titanic.head())
print("\nFirst ten rows: \n", titanic.head(10))

First five rows: 
        class     age  sex survived
1  1st class  adults  man      yes
2  1st class  adults  man      yes
3  1st class  adults  man      yes
4  1st class  adults  man      yes
5  1st class  adults  man      yes

First ten rows: 
         class     age  sex survived
1   1st class  adults  man      yes
2   1st class  adults  man      yes
3   1st class  adults  man      yes
4   1st class  adults  man      yes
5   1st class  adults  man      yes
6   1st class  adults  man      yes
7   1st class  adults  man      yes
8   1st class  adults  man      yes
9   1st class  adults  man      yes
10  1st class  adults  man      yes


## .columns

**Output**:  
The columns of the dataframe. These are usually considered the features.

**What it's Good For**:  
Quickly viewing the possible features of a dataset, especially when there are many. You can then select with features you want to keep and pare down the dataset for faster loading times.

**Cons**:  
It's possible that the dataset is a mess and the column labels are not representative of the data contained.

In [1]:
print("The columns of the data: \n", titanic.columns)

NameError: name 'titanic' is not defined

## .isnull( ) / .notnull( )

**Output**:  
Returns dataset composed of booleans indicating whether or not a value is null.

**What it's Good For**:  
if statements where you want to check is something is null. Best for passing a single value.

**Cons**:  
Is you want to see if you have any null values, you would have to look through the entire dataset.

In [None]:
print("Entire dataset:\n", titanic.isnull())

## .describe( )
***.describe( percentiles=None, include=None, exclude=None )***

**Output**:  
Varies depending on what is provided and the data's values. Generally descriptive statistics that summarize the dispension and state of the dataset's distribution. Does not include NAN values!

**What it's Good For**:  
Getting an idea of the scale of the data, possible number of feature values, highest frequency values, etc. Can't expect to get everything, but it provides a little glimpse.

**Cons**:  
Ouput depends on input. So, you can't be sure what information you will get in response. This can be seen in the example where there is not enough of a variance in data to provide different responses when changing the parameters becuase those options are not shown with the titanic data.

In [None]:
print("Describe: \n", titanic.describe())
print("\nDescibe with all columns: \n", titanic.describe(include='all'))
print("\nDescribe with different percentiles: \n", titanic.describe(percentiles=[.2, .4, .6, .8]))

## .info( )
***.info(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)***

**Output**:  
A summary of the dataframe.

**What it's Good For**:  
Checking to see if there are null objects that need to be cleaned, seeing datatypes of the values. In the example, we can see that all data features are objects, even though age sounds like it should be an int. This gives the clue that maybe we should examine this column more closely and make sure it represents what we *think* it represents.

**Cons**:  
None really that I can think of.

In [None]:
print("Inital info: \n", titanic.info())

## .sort_values( by )
***.sort_values( by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last' )***

**Output**:  
Sorts the data by the columns passed in **by**.

**What it's Good For**:  
Viewing the extremes of the dataset and visually finding the data you want more quickly. 

**Cons**:  
Can get confusing suprisingly enough.

In [None]:
print("Sorted by class descending and age ascending: \n",
     titanic.sort_values(['class', 'age'], ascending=[False, True]))

## .groupby( )
***.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False)***

__Output__:  
Filters the data set by the group by. So, kinda like splitting the data into groups and then providing the computation on each of these groups.

**What it's Good For**:  
When you want to analyze by category. "For each" question. Use it to calculate simple things like sim, average, count, etc.

**Cons**:  
Returns a GroupBy object which would have to be converted back into a dataframe if that's how you want to manipulate the data. Which is annoying.

In [None]:
print("Most freq of survival outcome for all:\n",
      titanic['survived'].value_counts())

#Orders the result in descending order
print("\nThe freq of survival outcome for all grouped by class:\n",
      titanic.groupby('class')['survived'].value_counts())

#Does not change the order of the data
print("\nSame as above in a different way:\n",
      titanic.groupby(['class', 'survived']).size())

**Output**:  
hfds fdsjnkcl dish kj dhd ilhdasdjh 

**What it's Good For**:  
hfds fdsjnkcl dish kj dhd ilhdasdjh 

**Cons**:  
hfds fdsjnkcl dish kj dhd ilhdasdjh 