# Fundamental or Basic Understanding of Our Data

### Whenever we receive data, our starting point involves asking seven fundamental questions to establish a foundational understanding. While there's no rigid rule, this method helps guide our understanding.

### The questions are: 
    1. How big is the data?
    2. How does the data look like?
    3. What is datatype of columns?
    4. Are there any missing values?
    5. How does the data look like mathematically?
    6. Are there any duplicate values?
    7. How is the correlation between cols?

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('train.csv')

# 1. How big is the data?

In [3]:
df.shape

(891, 12)

# 2. How does the data look like?

In [4]:
#to check the first 5 rows
df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


###### Sometimes, the data in a data set can be baised like it could follow some format at initial rows and then someother format later. so, to ensure that that is uniform and unbiased, its better to use sample


In [5]:
#to see sample
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
373,374,0,1,"Ringhini, Mr. Sante",male,22.0,0,0,PC 17760,135.6333,,C
57,58,0,3,"Novel, Mr. Mansouer",male,28.5,0,0,2697,7.2292,,C
481,482,0,2,"Frost, Mr. Anthony Wood ""Archie""",male,,0,0,239854,0.0,,S
482,483,0,3,"Rouse, Mr. Richard Henry",male,50.0,0,0,A/5 3594,8.05,,S
157,158,0,3,"Corn, Mr. Harry",male,30.0,0,0,SOTON/OQ 392090,8.05,,S


# 3. What is the data type of cols?

In [6]:
#to check the data type, not null values of all columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


###### from above we can also see the memory usage, and for age we don't need float, integer is enough. in such cases when the data and memory usage are large . we need to change that.

In [7]:
#handle them

# 4. Are there any missing values
###### you can check whether missing values are there or not using .info(). but, if we need count of null values exactly then in that case  it's better to use .isnull().sum()

In [8]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

###### when we know the missing values, we can choose how to handle depending on that.

# 5. How does the data look mathematically?

In [9]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


# 6. Are there duplicate values?

In [10]:
df.duplicated().sum()

0

# 7. How is the correlation between cols?

##### .corr() uses pearson correlation coefficient. ranges from -1 to 1

In [11]:
df.corr()['Survived']

  df.corr()['Survived']


PassengerId   -0.005007
Survived       1.000000
Pclass        -0.338481
Age           -0.077221
SibSp         -0.035322
Parch          0.081629
Fare           0.257307
Name: Survived, dtype: float64