**What is EDA?**

- Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and visualization techniques in order to bring important 
  aspects of that data into focus for further analysis.It involves analyzing and visualizing data to understand its key characteristics, uncover patterns,
  and identify relationships between variables refers to the method of studying and exploring record sets to apprehend their predominant traits, discover
  patterns, locate outliers, and identify relationships between variables. EDA is normally carried out as a preliminary step before undertaking extra 
  formal statistical analyses or modeling.

**EDA typically involves:**

1. Data cleaning and preprocessing
2. Summary statistics (mean, median, mode, etc.)
3. Data visualization (plots, charts, heatmaps, etc.)
4. Correlation analysis
5. Distribution analysis (normality, outliers, etc.)

**What is Dataframe**

- A Pandas DataFrame is a Two-dimensional, size-mutable, potentially heterogeneous tabular data.

**What is Pandas**

- Pandas is a Python library used for working with data sets. 

- It has functions for analyzing, cleaning, exploring, and manipulating data.

- The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

### Creating the DataFrame ###

**Empty DataFrame :**

In [4]:
import pandas as pd
pd.DataFrame()    #empty dataframe is created with no rows and columns

- To create a dataframe you must first create a dictionary. A dictionary is a list of values linked to keys. 
- The keys are separated from their values with colons and brackets as shown below. In this case, the dictionary keys will become the column names 
  for the DataFrame. For example, the key would be “Grades” and the values would be “A, B, C, D, F”.

In [5]:
import pandas as pd

names = ['Jorge','Maria', 'Joe']

age = ['25','24', '28']

pd.DataFrame(names,age)

Unnamed: 0,0
25,Jorge
24,Maria
28,Joe


- Here first positional argument : data
- and second positional argument : index

**Create pandas dataframe from lists using zip**
- One of the way to create Pandas DataFrame is by using zip() function. You can use the lists to create lists of tuples and create a dictionary from - - it. Then, this dictionary can be used to construct a dataframe. zip() function creates the objects and that can be used to produce single item at a - - time. This function can create pandas DataFrames by merging two lists.

In [9]:
import pandas as pd

names = ['Jorge','Maria', 'Joe']
age = ['25','24', '28']

pd.DataFrame(zip(names,age))

Unnamed: 0,0,1
0,Jorge,25
1,Maria,24
2,Joe,28


**Provide the column names**

In [10]:
import pandas as pd

names = ['Jorge','Maria', 'Joe']
age = ['25','24', '28']

pd.DataFrame(zip(names,age),
             columns=['Names','Age'])

Unnamed: 0,Names,Age
0,Jorge,25
1,Maria,24
2,Joe,28


In [11]:
df = pd.DataFrame(zip(names,age),
             columns=['Names','Age'])  # store datafrma in df

df

Unnamed: 0,Names,Age
0,Jorge,25
1,Maria,24
2,Joe,28


In [12]:
type(df)

pandas.core.frame.DataFrame

**Add new column**

- Check the number of rows in already existed data.

- For example if dataframe, we have 3 rows.

-  You need to create a new list with number of rows , 3 rows and that list equate to the data frame.

In [17]:
names = ['Jorge','Maria', 'Joe']
age = ['25','24', '28']
city = ['US', 'Canada', 'Spain'] # new column city

data= zip(names,age,city)
cols = ['Names','Age','city']

df=pd.DataFrame(data,columns=cols) 

df

Unnamed: 0,Names,Age,city
0,Jorge,25,US
1,Maria,24,Canada
2,Joe,28,Spain


**Create new column and update with new data**

In [19]:
job = ["Data scientist", 'Data Engineer', 'Cyber Security']  
df['job'] = job  # new column job is added 

df

Unnamed: 0,Names,Age,city,job
0,Jorge,25,US,Data scientist
1,Maria,24,Canada,Data Engineer
2,Joe,28,Spain,Cyber Security


**How to change the index**

In [29]:
names = ['Jorge','Maria', 'Joe']
age = ['25','24', '28']
city = ['US', 'Canada', 'Spain'] # new column city

data= zip(names,age,city)
cols = ['Names','Age','city']

idx = ['A','B','C']      # number of values in list = number of index

df=pd.DataFrame(data,index=idx,columns=cols) 
df

Unnamed: 0,Names,Age,city
A,Jorge,25,US
B,Maria,24,Canada
C,Joe,28,Spain


## Note ##
- Number of list = **number of columns**

- Number of values inside list = **number of rows**

## Shape ##

- Number of Rows
- Number of Columns

In [24]:
df.shape

(3, 3)

In [None]:
# shape is a matrix format
# provideds details of number of rows and columns 
# in above df, we have 3 rows and 3 columns

In [25]:
print("The number of rows are:",df.shape[0])
print("The number of columns are:",df.shape[1])

The number of rows are: 3
The number of columns are: 3


## How to drop the columns ## 

- In order to drop column we need 3 arguments
  - column name
  - axis  = 0 , **represents rows** , axis =1 , **represents column**
  - inplace - we are dropping column means we are modifying the df, so this modifications we want to save in same or different variable
  - if you want to keep in the same variable then inplace=True  **inplace=same place**

In [26]:
df

Unnamed: 0,Names,Age,city
A,Jorge,25,US
B,Maria,24,Canada
C,Joe,28,Spain


In [30]:
df.drop('city',axis=1,inplace=True) # dropping city column
df

Unnamed: 0,Names,Age
A,Jorge,25
B,Maria,24
C,Joe,28


In [32]:
df1 = df.drop('Age',axis=1)
df1

Unnamed: 0,Names
A,Jorge
B,Maria
C,Joe


In [33]:
df2 = df.drop('C',axis=0) # drop row
df2

Unnamed: 0,Names,Age
A,Jorge,25
B,Maria,24


## How to save Dataframe ##

**Write object to a comma-separated values (csv) file**

In [34]:
names = ['James','Maria', 'Mac','Jim']

age = ['28','30', '31','29']

city = ['US', 'Canada', 'Spain','Italy']

data= zip(names,age,city)
cols = ['Names','Age','city']
idx = ['A','B','C','D']      # number of values in list = number of index

df=pd.DataFrame(data,index=idx,columns=cols) 
df

Unnamed: 0,Names,Age,city
A,James,28,US
B,Maria,30,Canada
C,Mac,31,Spain
D,Jim,29,Italy


In [38]:
df.to_csv("dataCSV.csv",index=False) 

In [41]:
df.to_excel("data2.xlsx",index=False) # Write object to a excel file

## How to read a dataframe ##

- Need
    - file name, file location
    - location is not required since  python file and data file are in same place

In [39]:
pd.read_csv('dataCSV.csv')  # reading csv file

Unnamed: 0,Names,Age,city
0,James,28,US
1,Maria,30,Canada
2,Mac,31,Spain
3,Jim,29,Italy


In [42]:
pd.read_excel('data2.xlsx') # reading excel file

Unnamed: 0,Names,Age,city
0,James,28,US
1,Maria,30,Canada
2,Mac,31,Spain
3,Jim,29,Italy


### Create dataframe using dict

In [43]:
dict1 = {'Names':['Avinash','Akash','Adhya'],'Ages':[25,30,35]} # key=columns , values = rows
dictdf = pd.DataFrame(dict1)   # no need to provide column name seperatly
dictdf

Unnamed: 0,Names,Ages
0,Avinash,25
1,Akash,30
2,Adhya,35


In [44]:
dict2={'Name':'Avinash',
       'Age':30,
      'City':'Pune'}
pd.DataFrame(dict2,index=[1])

Unnamed: 0,Name,Age,City
1,Avinash,30,Pune


In [45]:
dict2={'Name':'Avinash',
       'Age':30,
      'City':'Pune'}
pd.DataFrame(dict2)  #index shoul be passed

ValueError: If using all scalar values, you must pass an index

**Import required packages**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pl
import seaborn as sns

In [50]:
visadf=pd.read_csv(r"C:\Users\Anuja_PC\OneDrive\Documents\DataScience_NareshIT\dataFiles\Visadataset.csv")
visadf   # reading visadat csv file

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.6500,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.8600,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.0300,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.3900,Year,Y,Certified
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,Asia,Bachelor's,Y,Y,2601,2008,South,77092.5700,Year,Y,Certified
25476,EZYV25477,Asia,High School,Y,N,3274,2006,Northeast,279174.7900,Year,Y,Certified
25477,EZYV25478,Asia,Master's,Y,N,1121,1910,South,146298.8500,Year,N,Certified
25478,EZYV25479,Asia,Master's,Y,Y,1918,1887,West,86154.7700,Year,Y,Certified


$head$
- Displays Top 5 rows

In [52]:
visadf.head()

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y,Certified


In [56]:
visadf.tail()   #last 5 records

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
25475,EZYV25476,Asia,Bachelor's,Y,Y,2601,2008,South,77092.57,Year,Y,Certified
25476,EZYV25477,Asia,High School,Y,N,3274,2006,Northeast,279174.79,Year,Y,Certified
25477,EZYV25478,Asia,Master's,Y,N,1121,1910,South,146298.85,Year,N,Certified
25478,EZYV25479,Asia,Master's,Y,Y,1918,1887,West,86154.77,Year,Y,Certified
25479,EZYV25480,Asia,Bachelor's,Y,N,3195,1960,Midwest,70876.91,Year,Y,Certified


In [57]:
visadf.tail(10)  # default 5, passing any other value will give that much rows

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
25470,EZYV25471,North America,Master's,Y,N,2272,1970,Northeast,516.4101,Hour,Y,Certified
25471,EZYV25472,Asia,High School,N,N,40224,1962,Island,75587.42,Year,Y,Certified
25472,EZYV25473,Asia,High School,N,N,1346,2003,Midwest,76155.6,Year,N,Certified
25473,EZYV25474,Asia,Bachelor's,Y,N,2421,2007,Northeast,22845.56,Year,Y,Certified
25474,EZYV25475,Africa,Doctorate,N,N,2594,1979,Northeast,51104.78,Year,Y,Certified
25475,EZYV25476,Asia,Bachelor's,Y,Y,2601,2008,South,77092.57,Year,Y,Certified
25476,EZYV25477,Asia,High School,Y,N,3274,2006,Northeast,279174.79,Year,Y,Certified
25477,EZYV25478,Asia,Master's,Y,N,1121,1910,South,146298.85,Year,N,Certified
25478,EZYV25479,Asia,Master's,Y,Y,1918,1887,West,86154.77,Year,Y,Certified
25479,EZYV25480,Asia,Bachelor's,Y,N,3195,1960,Midwest,70876.91,Year,Y,Certified


In [58]:
visadf.columns   # fetches the columns of dataframe

Index(['case_id', 'continent', 'education_of_employee', 'has_job_experience',
       'requires_job_training', 'no_of_employees', 'yr_of_estab',
       'region_of_employment', 'prevailing_wage', 'unit_of_wage',
       'full_time_position', 'case_status'],
      dtype='object')

In [59]:
visadf.dtypes # it gives the type of each column

case_id                   object
continent                 object
education_of_employee     object
has_job_experience        object
requires_job_training     object
no_of_employees            int64
yr_of_estab                int64
region_of_employment      object
prevailing_wage          float64
unit_of_wage              object
full_time_position        object
case_status               object
dtype: object