# Capstone Exercise
- Import relevant python libraries necessary for Python programming and Numpy for doing Numerical operations.
- Import the CSV file – NSMES1988.csv into a dataframe.
- Inspect the data and report the details from physical inspection – rows, columns, data types etc. (multiple functions)
- Find out if the data is clean or if the data has missing values.
- Comment on the data types, their values and their range, specifically on age and income columns.
- Export the data to JSON as NSMES1988.json format file and view and enter your comments.
- Perform memory information analysis and provide recommendations to improve consumption
- Apply recommendations by changing data types

In [1]:
import numpy as np 
import pandas as pd 

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
df = pd.read_csv('NSMES1988.csv')
df

Unnamed: 0.1,Unnamed: 0,visits,nvisits,ovisits,novisits,emergency,hospital,health,chronic,adl,region,age,afam,gender,married,school,income,employed,insurance,medicaid
0,1,5,0,0,0,0,1,average,2,normal,other,6.9,yes,male,yes,6,2.881000,yes,yes,no
1,2,1,0,2,0,2,0,average,2,normal,other,7.4,no,female,yes,10,2.747800,no,yes,no
2,3,13,0,0,0,3,3,poor,4,limited,other,6.6,yes,female,no,10,0.653200,no,no,yes
3,4,16,0,5,0,1,1,poor,2,limited,other,7.6,no,male,yes,3,0.658800,no,yes,no
4,5,3,0,0,0,0,0,average,2,limited,other,7.9,no,female,yes,6,0.658800,no,yes,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4401,4402,11,0,0,0,0,0,average,0,normal,other,8.4,no,female,yes,8,2.249700,no,yes,no
4402,4403,12,0,0,0,0,0,average,2,normal,other,7.8,no,female,no,11,5.813200,no,yes,no
4403,4404,10,0,20,0,1,1,average,5,normal,other,7.3,no,male,yes,12,3.877916,no,yes,no
4404,4405,16,1,0,0,0,0,average,0,normal,other,6.6,no,female,yes,12,3.877916,no,yes,no


In [6]:
#describe
print(f'our data frame has {df.shape[0]} rows and {df.shape[1]} columns')

print(df.dtypes)


our data frame has 4406 rows and 20 columns
Unnamed: 0      int64
visits          int64
nvisits         int64
ovisits         int64
novisits        int64
emergency       int64
hospital        int64
health         object
chronic         int64
adl            object
region         object
age           float64
afam           object
gender         object
married        object
school          int64
income        float64
employed       object
insurance      object
medicaid       object
dtype: object

Unnamed: 0    0
visits        0
nvisits       0
ovisits       0
novisits      0
emergency     0
hospital      0
health        0
chronic       0
adl           0
region        0
age           0
afam          0
gender        0
married       0
school        0
income        0
employed      0
insurance     0
medicaid      0
dtype: int64


In [7]:
print(df.isna().sum())

Unnamed: 0    0
visits        0
nvisits       0
ovisits       0
novisits      0
emergency     0
hospital      0
health        0
chronic       0
adl           0
region        0
age           0
afam          0
gender        0
married       0
school        0
income        0
employed      0
insurance     0
medicaid      0
dtype: int64


In [9]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4406 entries, 0 to 4405
Data columns (total 20 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  4406 non-null   int64  
 1   visits      4406 non-null   int64  
 2   nvisits     4406 non-null   int64  
 3   ovisits     4406 non-null   int64  
 4   novisits    4406 non-null   int64  
 5   emergency   4406 non-null   int64  
 6   hospital    4406 non-null   int64  
 7   health      4406 non-null   object 
 8   chronic     4406 non-null   int64  
 9   adl         4406 non-null   object 
 10  region      4406 non-null   object 
 11  age         4406 non-null   float64
 12  afam        4406 non-null   object 
 13  gender      4406 non-null   object 
 14  married     4406 non-null   object 
 15  school      4406 non-null   int64  
 16  income      4406 non-null   float64
 17  employed    4406 non-null   object 
 18  insurance   4406 non-null   object 
 19  medicaid    4406 non-null  

In [16]:
#Age and Income specific info
df['age'].max() - df['age'].min()

df['age'].describe()['max']



4.300000000000001

In [22]:
print(f'the range of age is {round(df['age'].max() - df['age'].min(),2)}')
print(f'the range of age is {round(df['income'].max() - df['income'].min(),2)}')

the range of age is 4.3
the range of age is 55.85


**Observations** 
- No missing data in dataframe 
- there are 20 columns  
    - 2 floats64, 9 int64, and 9 objects 
    - age: 
        - float64 data type 
        - range: 4.3  
    - income: 
        - float64
        - range: 55.85 
- current memory usage: 2.4MB 
- 4,406 rows 

- to reduce memory, change data types for age and income since they are limited in range for data type 



In [23]:
#export to json 
df.to_json("NSMES1988.json")

In [None]:
listcol = ['age','income']

for i in listcol: 
    df[i] = df[i].astype('int16')

In [33]:
df4 = df 
df4[['age','income']] = df4[['age','income']].astype('int16')
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4406 entries, 0 to 4405
Data columns (total 20 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  4406 non-null   int64 
 1   visits      4406 non-null   int64 
 2   nvisits     4406 non-null   int64 
 3   ovisits     4406 non-null   int64 
 4   novisits    4406 non-null   int64 
 5   emergency   4406 non-null   int64 
 6   hospital    4406 non-null   int64 
 7   health      4406 non-null   object
 8   chronic     4406 non-null   int64 
 9   adl         4406 non-null   object
 10  region      4406 non-null   object
 11  age         4406 non-null   int16 
 12  afam        4406 non-null   object
 13  gender      4406 non-null   object
 14  married     4406 non-null   object
 15  school      4406 non-null   int64 
 16  income      4406 non-null   int16 
 17  employed    4406 non-null   object
 18  insurance   4406 non-null   object
 19  medicaid    4406 non-null   object
dtypes: int16

In [30]:
#Perform memory information analysis and provide recommendations to improve consumption 
df2 = df
df2['age'] = df2['age'].astype('int16')
df2['income'] = df2['income'].astype('int16')

df2.info(memory_usage='deep')



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4406 entries, 0 to 4405
Data columns (total 20 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  4406 non-null   int64 
 1   visits      4406 non-null   int64 
 2   nvisits     4406 non-null   int64 
 3   ovisits     4406 non-null   int64 
 4   novisits    4406 non-null   int64 
 5   emergency   4406 non-null   int64 
 6   hospital    4406 non-null   int64 
 7   health      4406 non-null   object
 8   chronic     4406 non-null   int64 
 9   adl         4406 non-null   object
 10  region      4406 non-null   object
 11  age         4406 non-null   int16 
 12  afam        4406 non-null   object
 13  gender      4406 non-null   object
 14  married     4406 non-null   object
 15  school      4406 non-null   int64 
 16  income      4406 non-null   int16 
 17  employed    4406 non-null   object
 18  insurance   4406 non-null   object
 19  medicaid    4406 non-null   object
dtypes: int16

In [None]:
df3 = df 

