## `Pandas`
- used for cleaning, exploring, manipulating and analyzing the data.

#### &emsp;`why to use?`
- &emsp;&emsp; can analyze huge amount of data and derive statistical inferences.
- &emsp;&emsp; transforms messy data into readable and relevant.

#### &emsp; `Package installation`
```python
    # Type
    pip install pandas
    # in command prompt after activating the environment.
```

In [1]:
# import in file to use its features.
import pandas as pd

# check version
print(pd.__version__)

1.5.3


#### `Datatypes`
- Series
- Dataframe

### `Series`
- like a column in table.
- can hold data of any type.
- can hold 1-D array.

#### `Create series from list`

In [1]:
import pandas as pd
arr = [1, 2, 3]
print(type(arr))
series_1 = pd.Series(arr)

print(type(series_1))
print(series_1)

<class 'list'>
<class 'pandas.core.series.Series'>
0    1
1    2
2    3
dtype: int64


#### `Labels`
- If no labels are provided to the values, they are indexed with index number.
- The values can be annotated with labels explicitly
- Values can be accessed using index or labels (if provided).

#### `Create series from list along with labels.`

In [4]:
arr = [1, 2, 3]
labels = ['first', 'second', 'third']
series_2 = pd.Series(arr, index=labels)

print(type(series_2))
print(series_2)

<class 'pandas.core.series.Series'>
first     1
second    2
third     3
dtype: int64


#### `Access values`

In [5]:
print(f"first value of series_1: {series_1[0]}")

print(f"first value of series_2: {series_2['first']}")

first value of series_1: 1
first value of series_2: 1


#### `Create series from dictionary`

In [6]:
places_area = {'Kathmandu':1234, 'Pokhara':2345, 'Dharan':3456}

series_3 = pd.Series(places_area)
print(series_3)

# keys in dictionary becomes labels in series

Kathmandu    1234
Pokhara      2345
Dharan       3456
dtype: int64


#### `Create series from subset of dictionary`

In [7]:
places_area = {'Kathmandu':1234, 'Pokhara':2345, 'Dharan':3456}

series_4 = pd.Series(places_area, index=['Kathmandu', 'Pokhara'])
print(series_4)

Kathmandu    1234
Pokhara      2345
dtype: int64


### `DataFrame`
- like a table.
- 2-D data structure having rows and columns.

#### `Create a dataframe from dictionary`

In [8]:
subject_marks = {
    'english' : [50, 60, 70, 80 , 90],
    'math' : [51, 53, 55, 52, 50],
    'science' : [80, 81, 82, 83, 84],
    'computer' : [90, 91, 92, 93, 94]
}

df = pd.DataFrame(subject_marks)
print(df)

   english  math  science  computer
0       50    51       80        90
1       60    53       81        91
2       70    55       82        92
3       80    52       83        93
4       90    50       84        94


#### `Locate rows in DataFrame`
- uses `loc` attribute to get rows

In [9]:
# integer as index
row = df.loc[0]
print(type(row))
print(row)

<class 'pandas.core.series.Series'>
english     50
math        51
science     80
computer    90
Name: 0, dtype: int64


In [10]:
# list as index
row = df.loc[[0]]
print(type(row))
print(row)

<class 'pandas.core.frame.DataFrame'>
   english  math  science  computer
0       50    51       80        90


#### `Named Indexes`

In [11]:
# Create a dataframe from dictionary
subject_marks = {
    'english' : [50, 60, 70, 80 , 90],
    'math' : [51, 53, 55, 52, 50],
    'science' : [80, 81, 82, 83, 84],
    'computer' : [90, 91, 92, 93, 94]
}

labels = ['student1','student2','student3','student4','student5']

df = pd.DataFrame(subject_marks, index=labels)
print(df)

          english  math  science  computer
student1       50    51       80        90
student2       60    53       81        91
student3       70    55       82        92
student4       80    52       83        93
student5       90    50       84        94


#### `Access rows using labels`

In [12]:
row = df.loc['student1']
print(type(row))
print(row)

<class 'pandas.core.series.Series'>
english     50
math        51
science     80
computer    90
Name: student1, dtype: int64


In [13]:
row = df.loc[['student1']]
print(type(row))
print(row)

<class 'pandas.core.frame.DataFrame'>
          english  math  science  computer
student1       50    51       80        90


#### `Access columns using column names`

In [14]:
# List out the names of columns in dataframe
print('Before adding column(s)\n', df.columns)

Before adding column(s)
 Index(['english', 'math', 'science', 'computer'], dtype='object')


In [5]:
# Adding column for subject 'social'
df['social'] = [1, 2, 3, 4, 5]
print('After adding column(s)\n', df.columns)

NameError: name 'df' is not defined

In [16]:
df.head()

Unnamed: 0,english,math,science,computer,social
student1,50,51,80,90,1
student2,60,53,81,91,2
student3,70,55,82,92,3
student4,80,52,83,93,4
student5,90,50,84,94,5


#### `Find n-largest and n-smallest values in columns`

In [18]:
# find n-number of largest values based on particular column
print(df.nlargest(3, ['english']))

          english  math  science  computer  social
student5       90    50       84        94       5
student4       80    52       83        93       4
student3       70    55       82        92       3


In [19]:
# find n-number of largest values based on particular column
print(df.nsmallest(3, ['english']))

          english  math  science  computer  social
student1       50    51       80        90       1
student2       60    53       81        91       2
student3       70    55       82        92       3


#### `Convert dataframe into numpy array`

In [20]:
import pandas as pd
 
# initialize a dataframe
df = pd.DataFrame(
        [[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9],
        [10, 11, 12]],
        columns=['a', 'b', 'c']
    )
df.head()

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9
3,10,11,12


In [21]:
# convert dataframe to numpy array
arr = df.to_numpy()
 
print('Numpy Array \n', arr)
print('Type of array: ', type(arr))
print('Type of elements: ', arr.dtype)

Numpy Array 
 [[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]
Type of array:  <class 'numpy.ndarray'>
Type of elements:  int64


In [22]:
# creating series using column of dataframe
series = pd.Series(df['a'].head())
arr = series.to_numpy()

print('Numpy Array \n', arr)
print('Type of array: ', type(arr))
print('Type of elements: ', arr.dtype)

Numpy Array 
 [ 1  4  7 10]
Type of array:  <class 'numpy.ndarray'>
Type of elements:  int64


#### `Load Data from CSV file`
- uses pd.read_csv() to load into dataframe.

In [5]:
import pandas as pd
# Load csv data as DataFrame
df = pd.read_csv('./csv_files/organizations-10000.csv')

In [6]:
# Remove any of the column using column name
import pandas as pd
df = df.drop(['Index'], axis=1)

In [7]:
# statistical information of dataframe    #statictical data dincha int ma bhako ko
df.describe()

Unnamed: 0,Founded,Number of employees
count,10000.0,10000.0
mean,1995.7767,4961.321
std,14.991608,2911.862096
min,1970.0,1.0
25%,1983.0,2446.75
50%,1996.0,4894.0
75%,2009.0,7530.0
max,2022.0,9999.0


In [8]:
# Datatype, count, and non-null information
df.info()           #gives info of each col 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Organization Id      10000 non-null  object
 1   Name                 10000 non-null  object
 2   Website              10000 non-null  object
 3   Country              10000 non-null  object
 4   Description          10000 non-null  object
 5   Founded              10000 non-null  int64 
 6   Industry             10000 non-null  object
 7   Number of employees  10000 non-null  int64 
dtypes: int64(2), object(6)
memory usage: 625.1+ KB


In [9]:
# first 5 rows
df.head()

Unnamed: 0,Organization Id,Name,Website,Country,Description,Founded,Industry,Number of employees
0,522816eF8fdBE6d,Mckinney PLC,http://soto.com/,Sri Lanka,Synergized global system engine,1988,Dairy,3930
1,70C7FBD7e6Aa3Ea,Cunningham LLC,http://harding-duffy.com/,Namibia,Team-oriented fault-tolerant adapter,2018,Library,7871
2,428B397eA2d7290,Ruiz-Walls,http://www.atkins.biz/,Iran,Re-contextualized bifurcated moderator,2003,Hospital / Health Care,3095
3,9D234Ae8Cc51C1c,"Parrish, Osborne and Clarke",http://salazar.info/,British Indian Ocean Territory (Chagos Archipe...,Fully-configurable next generation concept,1989,Supermarkets,5422
4,6CDCcdE3D0b7b44,"Diaz, Robles and Haley",https://www.brooks-scott.net/,Botswana,Inverse intangible methodology,2013,Nanotechnology,3135


In [10]:
# last 5 rows
df.tail()

Unnamed: 0,Organization Id,Name,Website,Country,Description,Founded,Industry,Number of employees
9995,2EE82AD1Cd045cd,"Neal, Day and Wang",https://carson.net/,San Marino,Team-oriented multimedia core,2013,Import / Export,6123
9996,06f1568A2CaF04a,"Barrett, Rojas and Adkins",https://douglas-garza.com/,Turkmenistan,Cross-group dedicated methodology,2018,Human Resources / HR,9043
9997,B4B92A44e0331Bc,Franklin-Ayala,http://www.torres.org/,Yemen,Polarized exuding orchestration,1983,Financial Services,8951
9998,01D2539e270CEbd,Wolfe-Mckee,http://www.parks.com/,Togo,Balanced value-added ability,1975,Environmental Services,2505
9999,0D3b7DcFA21d92d,Beck LLC,https://wagner-glover.com/,Georgia,Decentralized context-sensitive service-desk,2007,Wireless,4552


In [11]:
# Remove rows with empty cell returning new dataframe
new_df = df.dropna()        #nan bhako row val remove garnalai 
len(new_df)

10000

In [None]:
# Remove rows with empty cell making change in original dataframe
df.dropna(inplace=True)             

In [None]:
# Removing rows with empty cell of particular column
df.dropna(subset=['Website'], inplace=True)

In [None]:
# replace all the NaN values in dataframe
df.fillna(0, inplace=True)

In [None]:
# replace the NaN values of particular column
df['Country'].fillna(0, inplace=True)

In [None]:
# Replace statistical values: mean, median, mode
num_emp_mean = df['Number of employees'].mean()
df['Number of employees'].fillna(num_emp_mean, inplace=True)

In [None]:
df

#### `Column Selection of Dataframe`

In [None]:
# Listing all the columns in dataframe
df.columns

In [None]:
# Selecting single column as Series
name = df['Name']
print(name, type(name))

In [None]:
name = df.Name
print(name, type(name))

In [None]:
# selecting single column as DataFrame
name = df[['Name']]
print(name, type(name))

#### `Column Removal`

In [None]:
df.columns

In [None]:
# Remove any of the column using column name
df.drop(['Website', 'Description'], axis=1, inplace=True)

In [None]:
df.columns

In [None]:
# Rename the column names with rename() method
df.rename(
    columns = {
        'Organization Id':'org_id',
        'Number of employees':'num_employees'
    },
    inplace = True
)

In [None]:
df.columns

#### `Adding new rows in dataframe`

In [None]:
# Add new row
new_data = {'org_id':'aaaaaa', 'Name':'SomeName', 'Country':'Nepal', 'Founded':2015, 'Industry':'software',
       'num_employees':50}

new_row = pd.DataFrame(new_data, index=[0])
df = pd.concat([new_row, df]).reset_index(drop=True)
df.head()

#### `Removal of duplicate rows`

In [None]:
## Run above cell more than once to create duplicate rows
# remove duplicated rows
dup = df.drop_duplicates().reset_index(drop=True)
dup

In [None]:
## Remove duplicates in particular column
df.drop_duplicates(subset=['Country'])

In [None]:
# Remove duplicates on specific column(s) and
# keep last occurance rather than first
df.drop_duplicates(subset=['Founded', 'Industry'], keep='last')

#### `Filter dataframe based on condition`

In [None]:
df.head()

In [None]:
condition = df['Country']=='Nepal'
condition

In [None]:
df[condition].head()

#### `Retrive unique values on columns`

In [None]:
# unique countries in column 'Country'
df['Country'].unique()

In [None]:
# number of unique values in column
df['Country'].nunique(dropna=True)

#### `Retrive index number of column`

In [None]:
index_num = df.columns.get_loc('Country')
print(index_num)

#### `Convert series into dataframe`

In [None]:
# Creating series
quantities =  [60, 20, 40, 90]
labels = ['apple', 'realme', 'oppo', 'xiaomi']
s = pd.Series(quantities, index=labels)
s

In [None]:
# Conversion into dataframe
col_name = "mobile brand"
s_df = s.to_frame(name=col_name)

print(type(s), f"\n{' '*18}to\n", type(s_df))
s_df

In [None]:
# reset the index
s_df = s_df.reset_index()
s_df

In [None]:
# drop the index column
s_df = s_df.drop(['index'], axis=1)
s_df