# DATA MANIPULATION USING PYTHON

## OBJECTIVE:

### 1.Consistent and organized data 
### 2.Insightful project data access
### 3.More valuable data
### 4.Reduces unnecessary data points
### 5.Allows us to calculate the probability of a score occurring within our normal distribution 


## 1.Dataset:

#### The Sleep Health and Lifestyle Dataset comprises 400 rows and 13 columns, covering a wide range of variables related to sleep and daily habits. It includes details such as gender, age, occupation, sleep duration, quality of sleep, physical activity level, stress levels, BMI category, blood pressure, heart rate, daily steps, and the presence or absence of sleep disorders.

https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset

## 2. Get your data into a DataFrame:

##### Load a SLEEP DataFrame from a CSV file

###### Import the pandas library as 'pd' and the numpy library as 'np'
###### Import specific components Series and DataFrame from pandas (not necessary as we already imported pandas with 'pd')

In [95]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

###### Read data from the CSV file "Sleep_health.csv" and create a DataFrame named 'frame' by function read_csv in panda:

In [96]:
frame=pd.read_csv("Sleep_health.csv")
frame

Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,0,0,0,0,0,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,1,1,1,1,1,Male,28,Software Engineer,6.2,6,60,8,Normal,125/80,75,10000,
2,2,2,2,2,2,Male,28,Software Engineer,6.2,6,60,8,Normal,125/80,75,10000,
3,3,3,3,3,3,Male,28,Software Engineer,5.9,4,30,8,Obese,140/90,85,3000,
4,4,4,4,4,4,Male,28,Software Engineer,5.9,4,30,8,Obese,140/90,85,3000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,369,369,369,369,369,Female,59,Software Engineer,8.1,9,75,3,Overweight,140/95,68,7000,
370,370,370,370,370,370,Female,59,Software Engineer,8.0,9,75,3,Overweight,140/95,68,7000,
371,371,371,371,371,371,Female,59,Software Engineer,8.1,9,75,3,Overweight,140/95,68,7000,
372,372,372,372,372,372,Female,59,Software Engineer,8.1,9,75,3,Overweight,140/95,68,7000,


## 3. Saving a DataFrame

##### Saving a DataFrame to a CSV file

###### Export the DataFrame 'frame' to a CSV file named "Sleep_health.csv"

In [97]:
frame.to_csv('Sleep_health.csv', encoding='utf-8')

##### Saving a DataFrame to a Python dictionary

###### Assuming 'frame' is a pandas DataFrame containing data
###### Convert the DataFrame 'frame' to a dictionary

In [98]:
dictionary = frame.to_dict()

##### Saving a DataFrame to a Python string

###### Convert the DataFrame 'frame' to a string representation

In [99]:
string = frame.to_string()

## 4. Working with the whole DataFrame

##### DataFrame contents/ structers

###### Get a concise summary of the DataFrame 'frame'

In [100]:
frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0.3             374 non-null    int64  
 1   Unnamed: 0.2             374 non-null    int64  
 2   Unnamed: 0.1             374 non-null    int64  
 3   Unnamed: 0               374 non-null    int64  
 4   Person ID                374 non-null    int64  
 5   Gender                   374 non-null    object 
 6   Age                      374 non-null    int64  
 7   Occupation               374 non-null    object 
 8   Sleep Duration           374 non-null    float64
 9   Quality of Sleep         374 non-null    int64  
 10  Physical Activity Level  374 non-null    int64  
 11  Stress Level             374 non-null    int64  
 12  BMI Category             374 non-null    object 
 13  Blood Pressure           374 non-null    object 
 14  Heart Rate               3

###### The DataFrame has a total of 374 entries (rows).
###### The DataFrame has 16 columns.
###### Each column is listed with its index, column name, and the number of non-null values it contains.
###### The Dtype column specifies the data type of each column.
###### The memory usage is shown, which indicates the amount of memory consumed by the DataFrame.

##### Summary of column statistics

###### Get summary statistics of the DataFrame 'frame'

In [101]:
frame.describe()

Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Age,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,Heart Rate,Daily Steps,Sleep Disorder
count,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,0.0
mean,186.5,186.5,186.5,186.5,186.5,42.184492,7.132086,7.312834,59.171123,5.385027,70.165775,6816.84492,
std,108.108742,108.108742,108.108742,108.108742,108.108742,8.673133,0.795657,1.196956,20.830804,1.774526,4.135676,1617.915679,
min,0.0,0.0,0.0,0.0,0.0,27.0,5.8,4.0,30.0,3.0,65.0,3000.0,
25%,93.25,93.25,93.25,93.25,93.25,35.25,6.4,6.0,45.0,4.0,68.0,5600.0,
50%,186.5,186.5,186.5,186.5,186.5,43.0,7.2,7.0,60.0,5.0,70.0,7000.0,
75%,279.75,279.75,279.75,279.75,279.75,50.0,7.8,8.0,75.0,7.0,72.0,8000.0,
max,373.0,373.0,373.0,373.0,373.0,59.0,8.5,9.0,90.0,8.0,86.0,10000.0,


###### The first column represents the statistics being calculated.
###### The rows beneath each statistic represent the corresponding values for that statistic.
###### The statistics included in the output are:

###### count: the number of non-null values for each column.
###### mean: the mean (average) value of each column.
###### std: the standard deviation of each column.
###### min: the minimum value of each column.
###### 25%: the 25th percentile value of each column.
###### 50%: the 50th percentile value (median) of each column.
###### 75%: the 75th percentile value of each column.
###### max: the maximum value of each column.
###### The column names in the output indicate the features or variables present in the DataFrame. However, it seems that the last column, "Sleep Disorder," has missing values (NaN) for all entries, as indicated by "0.0" for count, which suggests that there is no data available for that column.

#### Head

###### method in pandas is used to display the first few rows of a DataFrame. By default, it shows the first 5 rows, but you can specify the number of rows you want to display as an argument. In this case, we have specified 3 as the argument, so the method will display the first 3 rows of the DataFrame.

In [102]:
frame.head(3)

Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,0,0,0,0,0,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,1,1,1,1,1,Male,28,Software Engineer,6.2,6,60,8,Normal,125/80,75,10000,
2,2,2,2,2,2,Male,28,Software Engineer,6.2,6,60,8,Normal,125/80,75,10000,


 ###### display only 3 rows in head Each row represents a person's information, with columns representing different attributes such as Person ID, Gender, Age, Occupation, Sleep Duration, Quality of Sleep, Physical Activity Level, Stress Level, BMI Category, Blood Pressure, Heart Rate, Daily Steps, and Sleep Disorder.

#### Tail

###### The method tail() is used to display the last few rows of the DataFrame. In this case, frame.tail(2) would display the last two rows of the DataFrame called frame.

In [103]:
frame.tail(2)

Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
372,372,372,372,372,372,Female,59,Software Engineer,8.1,9,75,3,Overweight,140/95,68,7000,
373,373,373,373,373,373,Female,59,Software Engineer,8.1,9,75,3,Overweight,140/95,68,7000,


 ###### display only 2 rows in last data frame Each row represents a person's information, with columns representing different attributes such as Person ID, Gender, Age, Occupation, Sleep Duration, Quality of Sleep, Physical Activity Level, Stress Level, BMI Category, Blood Pressure, Heart Rate, Daily Steps, and Sleep Disorder.

##### Shape (row-count, column-count)
###### shape in the context of a DataFrame from Pandas (Python library) returns a tuple representing the dimensions of the DataFrame.

In [104]:
frame.shape

(374, 17)

###### The shape of the dataframe is (374, 16), which means it has 374 rows and 16 columns. Each row represents an individual's information, and each column represents a specific attribute or feature.

## 5. Working with Rows 

##### Keeping rows

###### Filter the DataFrame 'frame' to get only the rows where the 'Age' column is 59

In [105]:
df = frame[frame['Age']== 59]
print (df)

     Unnamed: 0.3  Unnamed: 0.2  Unnamed: 0.1  Unnamed: 0  Person ID  Gender  \
358           358           358           358         358        358  Female   
359           359           359           359         359        359  Female   
360           360           360           360         360        360  Female   
361           361           361           361         361        361  Female   
362           362           362           362         362        362  Female   
363           363           363           363         363        363  Female   
364           364           364           364         364        364  Female   
365           365           365           365         365        365  Female   
366           366           366           366         366        366  Female   
367           367           367           367         367        367  Female   
368           368           368           368         368        368  Female   
369           369           369         

###### The resulting dataframe df that correspond to individuals with an age of 59.

##### Dropping rows

###### the DataFrame 'frame' is being filtered to exclude rows where specific conditions are met. The conditions are:

###### The 'Gender' column is not equal to 'Male'.
###### The 'Age' column is not equal to 59.
###### The 'Quality of Sleep' column is not equal to 9.
###### The DataFrame is then updated to include only the rows that do not meet these conditions.


In [106]:
frame = frame[(frame['Gender']!= 'Male') & (frame['Age']!= 59) & (frame['Quality of Sleep']!= 9)]
frame

Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
16,16,16,16,16,16,Female,29,Software Engineer,6.5,5,40,7,Normal Weight,132/87,80,4000,
18,18,18,18,18,18,Female,29,Software Engineer,6.5,5,40,7,Normal Weight,132/87,80,4000,
30,30,30,30,30,30,Female,30,Software Engineer,6.4,5,35,7,Normal Weight,130/86,78,4100,
31,31,31,31,31,31,Female,30,Software Engineer,6.4,5,35,7,Normal Weight,130/86,78,4100,
32,32,32,32,32,32,Female,31,Software Engineer,7.9,8,75,4,Normal Weight,117/76,69,6800,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307,307,307,307,307,307,Female,52,Software Engineer,6.5,7,45,7,Overweight,130/85,72,6000,
308,308,308,308,308,308,Female,52,Software Engineer,6.6,7,45,7,Overweight,130/85,72,6000,
309,309,309,309,309,309,Female,52,Software Engineer,6.6,7,45,7,Overweight,130/85,72,6000,
310,310,310,310,310,310,Female,52,Software Engineer,6.6,7,45,7,Overweight,130/85,72,6000,


##### null values (NaN)

 ###### replace() method of the DataFrame 'frame' to replace the string value 'None' with NumPy's NaN (Not a Number) value, and it updates the DataFrame 'frame' in place with the modified values.

In [107]:
#first replace (Nane) to (NaN):

frame.replace('None', np.nan,inplace=True)
frame

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frame.replace('None', np.nan,inplace=True)


Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
16,16,16,16,16,16,Female,29,Software Engineer,6.5,5,40,7,Normal Weight,132/87,80,4000,
18,18,18,18,18,18,Female,29,Software Engineer,6.5,5,40,7,Normal Weight,132/87,80,4000,
30,30,30,30,30,30,Female,30,Software Engineer,6.4,5,35,7,Normal Weight,130/86,78,4100,
31,31,31,31,31,31,Female,30,Software Engineer,6.4,5,35,7,Normal Weight,130/86,78,4100,
32,32,32,32,32,32,Female,31,Software Engineer,7.9,8,75,4,Normal Weight,117/76,69,6800,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307,307,307,307,307,307,Female,52,Software Engineer,6.5,7,45,7,Overweight,130/85,72,6000,
308,308,308,308,308,308,Female,52,Software Engineer,6.6,7,45,7,Overweight,130/85,72,6000,
309,309,309,309,309,309,Female,52,Software Engineer,6.6,7,45,7,Overweight,130/85,72,6000,
310,310,310,310,310,310,Female,52,Software Engineer,6.6,7,45,7,Overweight,130/85,72,6000,


###### isnull() method in Pandas is used to create a boolean mask DataFrame that indicates whether each element in the original DataFrame is NaN (Not a Number) or missing.

In [108]:
frame.isnull()

Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
16,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
18,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
30,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
31,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
32,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
308,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
309,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
310,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True


##### Add row

In [109]:
frame1=pd.read_csv("Sleep_health.csv")
frame1

Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,0,0,0,0,0,0,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,1,1,1,1,1,1,Male,28,Software Engineer,6.2,6,60,8,Normal,125/80,75,10000,
2,2,2,2,2,2,2,Male,28,Software Engineer,6.2,6,60,8,Normal,125/80,75,10000,
3,3,3,3,3,3,3,Male,28,Software Engineer,5.9,4,30,8,Obese,140/90,85,3000,
4,4,4,4,4,4,4,Male,28,Software Engineer,5.9,4,30,8,Obese,140/90,85,3000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,369,369,369,369,369,369,Female,59,Software Engineer,8.1,9,75,3,Overweight,140/95,68,7000,
370,370,370,370,370,370,370,Female,59,Software Engineer,8.0,9,75,3,Overweight,140/95,68,7000,
371,371,371,371,371,371,371,Female,59,Software Engineer,8.1,9,75,3,Overweight,140/95,68,7000,
372,372,372,372,372,372,372,Female,59,Software Engineer,8.1,9,75,3,Overweight,140/95,68,7000,


###### appending a new row to the DataFrame using the loc method, and finally, you want to display the updated DataFrame.
######  In the new row you provided, 'NaN' is treated as a string. If you intend for it to be interpreted as a NaN (missing value), you should use np.nan from NumPy instead of the string ###### 'NaN'. Remember to import NumPy if you decide to use it.

In [128]:
frame1=pd.read_csv("Sleep_health.csv")
frame1.loc[len(frame1.index)]=[233,372,372,373,345,354,'Female',59,'Software Engineer',8.1,9,75,3,'Overweight','140/95',68,7000,'NaN']
frame1



Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,0,0,0,0,0,0,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,1,1,1,1,1,1,Male,28,Software Engineer,6.2,6,60,8,Normal,125/80,75,10000,
2,2,2,2,2,2,2,Male,28,Software Engineer,6.2,6,60,8,Normal,125/80,75,10000,
3,3,3,3,3,3,3,Male,28,Software Engineer,5.9,4,30,8,Obese,140/90,85,3000,
4,4,4,4,4,4,4,Male,28,Software Engineer,5.9,4,30,8,Obese,140/90,85,3000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
370,370,370,370,370,370,370,Female,59,Software Engineer,8.0,9,75,3,Overweight,140/95,68,7000,
371,371,371,371,371,371,371,Female,59,Software Engineer,8.1,9,75,3,Overweight,140/95,68,7000,
372,372,372,372,372,372,372,Female,59,Software Engineer,8.1,9,75,3,Overweight,140/95,68,7000,
373,373,373,373,373,373,373,Female,59,Software Engineer,8.1,9,75,3,Overweight,140/95,68,7000,


## 6. Working with Columns

##### grab the oclumn name

 ###### use the .columns attribute of a DataFrame to obtain the column names present in the DataFrame. It returns an Index object containing the column labels.

In [111]:
# We can grab the oclumn names with .columns
frame.columns

Index(['Unnamed: 0.3', 'Unnamed: 0.2', 'Unnamed: 0.1', 'Unnamed: 0',
       'Person ID', 'Gender', 'Age', 'Occupation', 'Sleep Duration',
       'Quality of Sleep', 'Physical Activity Level', 'Stress Level',
       'BMI Category', 'Blood Pressure', 'Heart Rate', 'Daily Steps',
       'Sleep Disorder'],
      dtype='object')

###### display all columns

##### Get the data of one column

In [112]:
frame.Gender

16     Female
18     Female
30     Female
31     Female
32     Female
        ...  
307    Female
308    Female
309    Female
310    Female
311    Female
Name: Gender, Length: 116, dtype: object

###### This code will return the values stored in the 'Gender' column of the DataFrame.

##### specific data columns

###### create a new DataFrame containing specific columns ('Gender', 'Age', and 'Occupation') from the original DataFrame 'frame'.

In [113]:
#Lets see some specific data columns
DataFrame(frame,columns=['Gender','Age','Occupation'])


Unnamed: 0,Gender,Age,Occupation
16,Female,29,Software Engineer
18,Female,29,Software Engineer
30,Female,30,Software Engineer
31,Female,30,Software Engineer
32,Female,31,Software Engineer
...,...,...,...
307,Female,52,Software Engineer
308,Female,52,Software Engineer
309,Female,52,Software Engineer
310,Female,52,Software Engineer


###### output the specific columns

##### updating the entire 'Occupation' column in the DataFrame 'frame' with the value "Software Engineer". This operation sets all the values in the 'Occupation' column to "Software Engineer".

In [114]:
#get specific data columns and rows 
frame['Occupation']="Software Engineer"
frame

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frame['Occupation']="Software Engineer"


Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
16,16,16,16,16,16,Female,29,Software Engineer,6.5,5,40,7,Normal Weight,132/87,80,4000,
18,18,18,18,18,18,Female,29,Software Engineer,6.5,5,40,7,Normal Weight,132/87,80,4000,
30,30,30,30,30,30,Female,30,Software Engineer,6.4,5,35,7,Normal Weight,130/86,78,4100,
31,31,31,31,31,31,Female,30,Software Engineer,6.4,5,35,7,Normal Weight,130/86,78,4100,
32,32,32,32,32,32,Female,31,Software Engineer,7.9,8,75,4,Normal Weight,117/76,69,6800,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307,307,307,307,307,307,Female,52,Software Engineer,6.5,7,45,7,Overweight,130/85,72,6000,
308,308,308,308,308,308,Female,52,Software Engineer,6.6,7,45,7,Overweight,130/85,72,6000,
309,309,309,309,309,309,Female,52,Software Engineer,6.6,7,45,7,Overweight,130/85,72,6000,
310,310,310,310,310,310,Female,52,Software Engineer,6.6,7,45,7,Overweight,130/85,72,6000,


###### output specific data columns and rows

###### adding a new column named "Person ID" to the DataFrame 'frame', and you are populating it with a sequence of numbers using NumPy's arange() function.

In [115]:
#Putting numbers for stadiums
frame["Person ID"]= np.arange(116)
frame

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  frame["Person ID"]= np.arange(116)


Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
16,16,16,16,16,0,Female,29,Software Engineer,6.5,5,40,7,Normal Weight,132/87,80,4000,
18,18,18,18,18,1,Female,29,Software Engineer,6.5,5,40,7,Normal Weight,132/87,80,4000,
30,30,30,30,30,2,Female,30,Software Engineer,6.4,5,35,7,Normal Weight,130/86,78,4100,
31,31,31,31,31,3,Female,30,Software Engineer,6.4,5,35,7,Normal Weight,130/86,78,4100,
32,32,32,32,32,4,Female,31,Software Engineer,7.9,8,75,4,Normal Weight,117/76,69,6800,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307,307,307,307,307,111,Female,52,Software Engineer,6.5,7,45,7,Overweight,130/85,72,6000,
308,308,308,308,308,112,Female,52,Software Engineer,6.6,7,45,7,Overweight,130/85,72,6000,
309,309,309,309,309,113,Female,52,Software Engineer,6.6,7,45,7,Overweight,130/85,72,6000,
310,310,310,310,310,114,Female,52,Software Engineer,6.6,7,45,7,Overweight,130/85,72,6000,


###### create a DataFrame named "frame" and assign sequential numbers from 0 to 115 to the "Person ID" column. You can adjust the range of numbers by modifying the parameter inside the arange function.

###### creating a new DataFrame called 'sleep_frame' using the dictionary 'data'. The dictionary has two keys, 'Gender' and 'Person ID', and each key corresponds to a list containing data for the respective column.

In [116]:
#DataFrames can be constructed many ways. Another way is from a dictionary of␣,→equal length lists

data = {'Gender':['Male','Female'],
'Person ID':[2,370]}
sleep_frame = DataFrame(data)
#Show
sleep_frame

Unnamed: 0,Gender,Person ID
0,Male,2
1,Female,370


###### Gender' and 'Person ID'. The 'Gender' column contains the values 'Male' and 'Female', while the 'Person ID' column contains the values 2 and 370.

###### in Pandas, you can delete a column from a DataFrame using the del keyword. It allows you to remove a specific column from the DataFrame.

In [117]:
#We can also delete columns
del frame['Blood Pressure']
frame


Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Heart Rate,Daily Steps,Sleep Disorder
16,16,16,16,16,0,Female,29,Software Engineer,6.5,5,40,7,Normal Weight,80,4000,
18,18,18,18,18,1,Female,29,Software Engineer,6.5,5,40,7,Normal Weight,80,4000,
30,30,30,30,30,2,Female,30,Software Engineer,6.4,5,35,7,Normal Weight,78,4100,
31,31,31,31,31,3,Female,30,Software Engineer,6.4,5,35,7,Normal Weight,78,4100,
32,32,32,32,32,4,Female,31,Software Engineer,7.9,8,75,4,Normal Weight,69,6800,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307,307,307,307,307,111,Female,52,Software Engineer,6.5,7,45,7,Overweight,72,6000,
308,308,308,308,308,112,Female,52,Software Engineer,6.6,7,45,7,Overweight,72,6000,
309,309,309,309,309,113,Female,52,Software Engineer,6.6,7,45,7,Overweight,72,6000,
310,310,310,310,310,114,Female,52,Software Engineer,6.6,7,45,7,Overweight,72,6000,


###### adding a new column 'Blood Pressure' to the DataFrame 'frame' by extracting a substring from the 'Gender' column and then creating a new DataFrame 'df' and adding another column 'Blood Pressure' to it using a list of values.

In [118]:
#Adding new columns to a DataFrame
'''frame['Blood Pressure'] = frame['Gender'].str[0:2]'''
lst=list(range(16))
df['Blood Pressure']=lst
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Blood Pressure']=lst


Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
358,358,358,358,358,358,Female,59,Software Engineer,8.0,9,75,3,Overweight,0,68,7000,
359,359,359,359,359,359,Female,59,Software Engineer,8.1,9,75,3,Overweight,1,68,7000,
360,360,360,360,360,360,Female,59,Software Engineer,8.2,9,75,3,Overweight,2,68,7000,
361,361,361,361,361,361,Female,59,Software Engineer,8.2,9,75,3,Overweight,3,68,7000,
362,362,362,362,362,362,Female,59,Software Engineer,8.2,9,75,3,Overweight,4,68,7000,
363,363,363,363,363,363,Female,59,Software Engineer,8.2,9,75,3,Overweight,5,68,7000,
364,364,364,364,364,364,Female,59,Software Engineer,8.0,9,75,3,Overweight,6,68,7000,
365,365,365,365,365,365,Female,59,Software Engineer,8.0,9,75,3,Overweight,7,68,7000,
366,366,366,366,366,366,Female,59,Software Engineer,8.1,9,75,3,Overweight,8,68,7000,
367,367,367,367,367,367,Female,59,Software Engineer,8.0,9,75,3,Overweight,9,68,7000,


######  adds a column called 'Blood Pressure' using the values from the 'lst' list. Finally, it prints the resulting DataFrame, including the newly added column.

## Joining/Combining DataFrames and Groupby

##### Merge on columns

In [119]:
# get a new data frames:  
frame2=pd.read_csv("2015.csv")
frame2

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.43630,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.03880,1.45900,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176
...,...,...,...,...,...,...,...,...,...,...,...,...
153,Rwanda,Sub-Saharan Africa,154,3.465,0.03464,0.22208,0.77370,0.42864,0.59201,0.55191,0.22628,0.67042
154,Benin,Sub-Saharan Africa,155,3.340,0.03656,0.28665,0.35386,0.31910,0.48450,0.08010,0.18260,1.63328
155,Syria,Middle East and Northern Africa,156,3.006,0.05015,0.66320,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858
156,Burundi,Sub-Saharan Africa,157,2.905,0.08658,0.01530,0.41587,0.22396,0.11850,0.10062,0.19727,1.83302


###### two DataFrames named 'frame1' and 'frame2'. You are concatenating them along the rows using the pd.concat() function, resulting in a new DataFrame called 'merge_frame'. This operation stacks the rows of 'frame2' below the rows of 'frame1'.

In [120]:
frames = [frame1,frame2]
merge_frame= pd.concat(frames)
merge_frame

Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,...,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,0.0,0.0,0.0,0.0,0.0,0.0,Male,27.0,Software Engineer,6.1,...,,,,,,,,,,
1,1.0,1.0,1.0,1.0,1.0,1.0,Male,28.0,Software Engineer,6.2,...,,,,,,,,,,
2,2.0,2.0,2.0,2.0,2.0,2.0,Male,28.0,Software Engineer,6.2,...,,,,,,,,,,
3,3.0,3.0,3.0,3.0,3.0,3.0,Male,28.0,Software Engineer,5.9,...,,,,,,,,,,
4,4.0,4.0,4.0,4.0,4.0,4.0,Male,28.0,Software Engineer,5.9,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
153,,,,,,,,,,,...,154.0,3.465,0.03464,0.22208,0.77370,0.42864,0.59201,0.55191,0.22628,0.67042
154,,,,,,,,,,,...,155.0,3.340,0.03656,0.28665,0.35386,0.31910,0.48450,0.08010,0.18260,1.63328
155,,,,,,,,,,,...,156.0,3.006,0.05015,0.66320,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858
156,,,,,,,,,,,...,157.0,2.905,0.08658,0.01530,0.41587,0.22396,0.11850,0.10062,0.19727,1.83302


###### mergen two data frame

## aggregating function

###### calculates the sum of the values in the 'Sleep Duration' column of the DataFrame 'frame'.

In [121]:
frame['Sleep Duration'].sum()

768.4000000000001

###### 'frame' is a DataFrame with various columns containing different types of data, calling frame.nunique() will give you a Series where the index represents the column names, and the values represent the number of unique values in each respective column.

In [122]:
frame.nunique()

Unnamed: 0.3               116
Unnamed: 0.2               116
Unnamed: 0.1               116
Unnamed: 0                 116
Person ID                  116
Gender                       1
Age                         20
Occupation                   1
Sleep Duration              16
Quality of Sleep             5
Physical Activity Level     11
Stress Level                 5
BMI Category                 4
Heart Rate                  13
Daily Steps                 13
Sleep Disorder               0
dtype: int64

In [123]:
frame1.describe()

Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Person ID,Age,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,Heart Rate,Daily Steps,Sleep Disorder
count,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,0.0
mean,186.5,186.5,186.5,186.5,186.5,186.5,42.184492,7.132086,7.312834,59.171123,5.385027,70.165775,6816.84492,
std,108.108742,108.108742,108.108742,108.108742,108.108742,108.108742,8.673133,0.795657,1.196956,20.830804,1.774526,4.135676,1617.915679,
min,0.0,0.0,0.0,0.0,0.0,0.0,27.0,5.8,4.0,30.0,3.0,65.0,3000.0,
25%,93.25,93.25,93.25,93.25,93.25,93.25,35.25,6.4,6.0,45.0,4.0,68.0,5600.0,
50%,186.5,186.5,186.5,186.5,186.5,186.5,43.0,7.2,7.0,60.0,5.0,70.0,7000.0,
75%,279.75,279.75,279.75,279.75,279.75,279.75,50.0,7.8,8.0,75.0,7.0,72.0,8000.0,
max,373.0,373.0,373.0,373.0,373.0,373.0,59.0,8.5,9.0,90.0,8.0,86.0,10000.0,


###### mean() calculates the mean (average) of the values in the 'Stress Level' column of the DataFrame 'frame1'.

In [124]:
frame1['Stress Level'].mean()

5.385026737967914

###### std() calculates the standard deviation of the values in the 'Stress Level' column of the DataFrame 'frame1'.

In [125]:
frame1['Stress Level'].std()

1.774526444198519

###### max() retrieves the maximum value from the 'Age' column in the DataFrame 'frame1'. It calculates and returns the highest age value present in that particular column.

In [126]:
frame1['Age'].max()

59

###### min() retrieves the minimum value from the 'Age' column in the DataFrame 'frame1'. It calculates and returns the lowest age value present in that particular column.

In [127]:
frame1['Age'].min()

27