<a href="https://colab.research.google.com/github/ella13162/DataScience/blob/main/pandas_df_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# This first cell is necessary only if you use Google Colab
# In jupyter, you can delete it
import io
from google.colab import files
uploaded = files.upload()

Saving MentalHQ2020withHeader.csv to MentalHQ2020withHeader.csv


In [None]:
# Import the Pandas library and give it a nickname of 'pd'
# Pandas is a data manipulation and analysis library
import pandas as pd

In [None]:
# Read the csv file and save it in a variable

# for google colab
df = pd.read_csv(io.BytesIO(uploaded['MentalHQ2020withHeader.csv']))
# For jupyter
# df = pd.read_csv('MentalHQ2020withHeader.csv')

### **In Pandas, instead of a typical Python array, we use DataFrames**

The Pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields.

DataFrames are similar to SQL tables or Excel spreadsheets, you can think of them as tables of data, that also carry metadata (descriptive information) for your dataset.

In [None]:
df # same as writing print(df)

In [None]:
df.head(10) # returns the first n rows, as specified within the ()
            # if no number is given, it will return the first 5 rows
df.tail() # returns the last n rows, as specified within the ()

In [None]:
df.info() # returns metadata for your DataFrame

In [None]:
df['gender'] # returns the 'gender' column

df[['dob','gender']].loc[50:60] # returns the columns 'dob' and 'gender'
                                # the .loc method specifies a range of rows to be returned

Unnamed: 0,dob,gender
50,31/01/1992,Female
51,11/07/2003,Female
52,09/10/1993,Female
53,,Female
54,04/03/1992,Female
55,22/02/1996,Female
56,16/01/1994,Female
57,28/10/1990,Female
58,03/01/1982,Female
59,04/03/1992,Female


In [None]:
print(df.describe())  # returns statistical information for the whole DataFrame
                      # these measure are calculated onlt for numerical data

# Print each statistical measure separately
print('Mean values:\n',df.mean(),'\n')
print('Std values:\n',df.std(),'\n')
print('Min values:\n',df.min(),'\n')
print('Max values:\n',df.max(),'\n')
print('First quantile values:\n',df.quantile(0.25),'\n')
print('Third quantile values:\n',df.quantile(0.75),'\n')

Mean values:
 birthpositionfather    3.514507
birthpositionmother    2.777563
averagebreakperday     2.129032
dtype: float64 

Std values:
 birthpositionfather    2.621150
birthpositionmother    1.996809
averagebreakperday     2.226444
dtype: float64 

Min values:
 Timestamp              2020/04/12 10:55:12 PM GMT+1
birthpositionfather                             1.0
birthpositionmother                             1.0
averagebreakperday                              0.0
dtype: object 

Max values:
 Timestamp              2020/09/24 3:47:59 PM GMT+1
birthpositionfather                           10.0
birthpositionmother                           10.0
averagebreakperday                            10.0
dtype: object 

First quantile values:
 birthpositionfather    1.0
birthpositionmother    1.0
averagebreakperday     1.0
Name: 0.25, dtype: float64 

Third quantile values:
 birthpositionfather    5.0
birthpositionmother    4.0
averagebreakperday     3.0
Name: 0.75, dtype: float64 



  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """
  


In [None]:
# This is a function to calculate outliers

def find_outliers(df):
  q1 = df.quantile(0.25)
  q3 = df.quantile(0.75)
  iqr = q3 - q1

  # we have an outlier when
  # value < q1 - 1.5*iqr
  # OR
  # value > q3 +1.5*iqr
  outlier = df[ ( (df < (q1 - 1.5*iqr)) | (df > (q3 + 1.5*iqr)) ) ]

  return outlier

In [None]:
# calling the function for a specific column of the DataFrame
outlier = find_outliers(df['birthpositionmother'])
outlier_len = len(outlier)
print(outlier, outlier_len)

24     10.0
39     10.0
57      9.0
62     10.0
154    10.0
250    10.0
283     9.0
307    10.0
334     9.0
383    10.0
535    10.0
Name: birthpositionmother, dtype: float64

In [None]:
# Finding null values and replacing them

print(df.isnull()) # returns False/True depending if the data point is a null value or not
df.isnull().sum() # the .sum() method will return the count of null values per column

Timestamp                        0
dob                             42
gender                          16
countryofresidence              24
birthpositionfather             25
birthpositionmother             25
ethnicity                       16
highesteducation                15
occupation                      28
averagedayworkinghours          46
averagebreakperday              46
averageyearholiday              24
maritalstatus                   13
homeresidentialstatus           13
hometype                        16
religion                        14
childhoodliving                 12
currentliving                   12
psychoactivesubstance           14
ageofpsychoactivesubstance      31
mentalhealthhistory             11
yearofdiagnosis                474
mentalsymptoms                 461
mentalhealthtype               461
mentalhealthtreatment          463
medications                    494
everadmittedformentalhealth    461
 admissionage                  502
treatmentcarepathway

In [None]:
# We can delete entire columns / rows using the .drop() method on our DataFrame
# We need to specify the exact name of the column
# axis = 1 --> column , axis = 0 --> row (if you need to delete a row)
# inplace = True --> If specified, the column will be deleted on the original DataFrame
#                    If you don't specify it, then a NEW DataFrame will be created and to you will need to assign it
#                    to a variable in order to save it

updated_df = df.drop(' admissionage' ,axis=1) # drop a column and store to new DataFrame
updated_df.info()

# delete column with inplace attribute
df.drop(' admissionage' ,axis=1, inplace=True)
df.info()

In [None]:
# We can fill in null values of a column by using the .fillna() method
# In the method we need to specify HOW the values will be filled in
# This can be with a simple number, a false value (random value within a range), or some statistical measure (like the mean value of the column)

print('Mean: ', df['birthpositionfather'].mean())
# Fill in the null values of the 'birthpositionfather' column with the mean value of 'birthpositionfather'
df['birthpositionfather'].fillna(df['birthpositionfather'].mean()).loc[0:19]

# Fill in the null values of 'birthpositionfather' with the number 3.0
# df['birthpositionfather'].fillna(3.0).loc[0:19]

Mean:  3.514506769825919


0     3.0
1     2.0
2     1.0
3     4.0
4     1.0
5     3.0
6     2.0
7     1.0
8     4.0
9     1.0
10    5.0
11    1.0
12    4.0
13    1.0
14    1.0
15    3.0
16    8.0
17    3.0
18    2.0
19    1.0
Name: birthpositionfather, dtype: float64