# Review Data

Do an initial review of the data
    - combine data
    - convert format (eg. to dataframe, dictionary, etc.)
    - check & change variable type if necessary (eg. int, float, etc.)
    - get basic statistics (eg. ave, min, max, etc.)

## 1. Combine Data
People seem to combine the test and train dataset and do the manipulation on that. Then separate it afterwards?
<br>Udemy course says not to combine to do manipulation because if use something like frequent category imputation then the test set could be completely different than the train set.

I think data may be combined where there is not a sufficiently large test dataset.

NEED MORE INFO HERE

In [None]:
# Combine train and test dataset
combined_df =  pd.concat(objs=[train_df, test_df], axis=0, sort=False).reset_index(drop=True)

## 2. Format

Can change the format in order to allow for mathematical operations over the entire dataset. Format options are:
    - LIST (square brackets) (eg.['hi',1,[1,2]])
    - tuple (round brackets) (eg. ( 'abcd', 786 , 2.23, 'john', 70.2  ) )
    - dictionary (curly brackets with colon) (eg. {'name': 'john'}) (d = {'key1':'item1','key2':'item2'})
    - pandas dataframe

In [None]:
# check format
type(df)

# change format
# You can convert a list,numpy array, or dictionary to a Series
my_list = [10,20,30]

# convert list to series (can also use pd.Series to convert dictionaries and arrays)
pd.Series(data=my_list)

## 3. Variable Type

Setting proper variable type:
- helps with memory management
- determine what values you can assign to it and what you can do to it (including operations you can perform)
- helps to select appropriate plots for visualization

Python has variable types: 
    - numbers
        - INT (eg. 5, -6) (enables mathematical operations)
        - LONG (eg. 51924361L, 0xDEFABCEC)
        - FLOAT (eg. 15.20, 32.3+e18) (enables mathematical operations)
        - COMPLEX (eg. 9.322e-36j)
    - DATETIME (enables date-based attributes and methods)
    - CATEGORY (uses less memory and runs fast)
    - BOOLEAN (eg. True or False) (enable logical and mathematical operations)
    - STRINGS (eg. HelloWorld!) (can use single or double quotes) (quotes inside quotes: " wrap lot's of quotes")
    
Variable can be:
- CATEGORICAL - boolean, strings, category, date 
    - nominal (no order eg. postcode), ordinal (can be meaningful ordered eg. student grade)
- NUMERIC - int, long, float, complex 
    - discrete (always round number eg.# family members), continuous (any value within some range eg. house cost
- MIXED - numbers and/ or labels

If given in dataset, look at the given description of the column names (OR search the web to determine what the variable means) to determine if the type of variable listed in "info" is appropriate or should be changed.
 Need more info here on when to change and the variable types

In [None]:
# DETERMINE variable types of columns with
# if not many columns, can use
train.dtypes
# OR if many columns, use
# Get a Series object containing the data type objects of each column of Dataframe.
# Index of series is column name.
dataTypeSeries = train.dtypes
print('Data type of each column of Dataframe :')
print(dataTypeSeries.to_string())

# OR specific columns
type(df['timeStamp'].iloc[0])

In [None]:
# REVIEW variable values
for var in data.columns:
    print(var, data[var].unique()[0:20], '\n')

Make a summary of any columns that will be feature engineered (eg. mixed variable columns) AND any columns that will be changed.

Also, can do a summary of which columns are what type, etc.

In [None]:
# OPTIONAL!
# make list of variables  types

# numerical: discrete vs continuous
discrete = [var for var in data.columns if data[var].dtype!='O' and var!='survived' and data[var].nunique()<10]
continuous = [var for var in data.columns if data[var].dtype!='O' and var!='survived' and var not in discrete]

# mixed
mixed = ['cabin']

# categorical
categorical = [var for var in data.columns if data[var].dtype=='O' and var not in mixed]

print('There are {} discrete variables'.format(len(discrete)))
print('There are {} continuous variables'.format(len(continuous)))
print('There are {} categorical variables'.format(len(categorical)))
print('There are {} mixed variables'.format(len(mixed)))

# can list what is in each type of variable
# put each of the commands below in separate cell
discrete
continuous
categorical
mixed

In [None]:
# IF NECESSARY, CHANGE to appropriate format, refer to above notes
# Not sure when to change yet? save memory

# change datatype where the brackets on left side creates or overwrites existing series
dataset['column_name']=dataset.column.astype('data_type')

# Each type of integer has a different range of storage capacity
#  Type      Capacity
#  Int16 -- (-32,768 to +32,767)
#  Int32 -- (-2,147,483,648 to +2,147,483,647)
#  Int64 -- (-9,223,372,036,854,775,808 to +9,223,372,036,854,775,807)

# Cast all columns to int32
df.astype('int32').dtypes

# Cast all columns to int64
ser.astype('int64')

# Cast col1 to int32 using a dictionary
df.astype({'col1': 'int32'}).dtypes

# Cast all columns to category
ser.astype('category')

## 4. Get Basic Statistics

- look for correlations in features by using metadata
- is there a way to rate similarity

In [None]:
# Determine # rows & columns in train and test dataset
# This will help determine what columns may need to be added if train and test set are merged

# printing size and shape 
print("Size of train- total number of elements = {}\nShape of train- rows then columns = {}".
format(train.size, train.shape))
print()
print("Size of features- total number of elements = {}\nShape of features- rows then columns = {}".
format(features.size, features.shape))

In [None]:
# head, gives first few rows of data (n= num rows)
df.head(n=5)

# tail
df.tail()

# describe gives, count, mean, std, etc.
df.describe()

# info gives variables, types, nulls, etc.
# change the type of column is needed
df.info()

# We can grab information and arrays out of this dictionary to set up our data frame and understanding of the features
print(cancer['DESCR'])

Make a note of any observations from above summary statistics
- are there any inappropriate values
- outliers (too high and/ or low)
- list more here

##### Possible Additional Summary Statistics

May have some questions that want to have answered

In [None]:
# some datasets have a description
print(boston_dataset.DESCR)

# get mean of a column that is grouped
df.groupby('Company').mean()

# can use: .std(), .min(), .max(), .count(), .transpose()

# What is the max Close price for each bank's stock throughout the time period?
# xs() function is used to get cross-section from the Series/DataFrame
# this dataframe had bank stock header then a stock info ticker
bank_stocks.xs(key='Close',axis=1,level='Stock Info').max()

# What is the highest amount of OvertimePay in the dataset ?
sal['OvertimePay'].max(axis=0)

# What is the job title of JOSEPH DRISCOLL ?
sal[sal['EmployeeName']=='JOSEPH DRISCOLL']['JobTitle']

# What is the name of highest paid person (including benefits)?
sal[sal['TotalPayBenefits']== sal['TotalPayBenefits'].max()] #['EmployeeName']

# mean for each year
s= sal.groupby('Year')
s.mean()['BasePay']

# What are the top 5 most common jobs?
sal['JobTitle'].value_counts().head(5)

# How many people have the word Chief in their job title? 
def chief_string(title):
    if 'chief' in title.lower():
        return True
    else:
        return False

# % change for each bank
for tick in tickers:
    returns[tick+' Return'] = bank_stocks[tick]['Close'].pct_change()
returns.head()

- print("Dates go from day", max(train['Date']), "to day", min(train['Date']), ", a total of", train['Date'].nunique(), "days")
- print("Countries with Province/State informed: ", train.loc[train['Province_State']!='None']['Country_Region'].unique())