# Module: Pandas Assignments
## Lesson: Pandas
### Assignment 1: DataFrame Creation and Indexing

1. Create a Pandas DataFrame with 4 columns and 6 rows filled with random integers. Set the index to be the first column.
2. Create a Pandas DataFrame with columns 'A', 'B', 'C' and index 'X', 'Y', 'Z'. Fill the DataFrame with random integers and access the element at row 'Y' and column 'B'.

### Assignment 2: DataFrame Operations

1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Add a new column that is the product of the first two columns.
2. Create a Pandas DataFrame with 3 columns and 4 rows filled with random integers. Compute the row-wise and column-wise sum.

### Assignment 3: Data Cleaning

1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Introduce some NaN values. Fill the NaN values with the mean of the respective columns.
2. Create a Pandas DataFrame with 4 columns and 6 rows filled with random integers. Introduce some NaN values. Drop the rows with any NaN values.

### Assignment 4: Data Aggregation

1. Create a Pandas DataFrame with 2 columns: 'Category' and 'Value'. Fill the 'Category' column with random categories ('A', 'B', 'C') and the 'Value' column with random integers. Group the DataFrame by 'Category' and compute the sum and mean of 'Value' for each category.
2. Create a Pandas DataFrame with 3 columns: 'Product', 'Category', and 'Sales'. Fill the DataFrame with random data. Group the DataFrame by 'Category' and compute the total sales for each category.

### Assignment 5: Merging DataFrames

1. Create two Pandas DataFrames with a common column. Merge the DataFrames using the common column.
2. Create two Pandas DataFrames with different columns. Concatenate the DataFrames along the rows and along the columns.

### Assignment 6: Time Series Analysis

1. Create a Pandas DataFrame with a datetime index and one column filled with random integers. Resample the DataFrame to compute the monthly mean of the values.
2. Create a Pandas DataFrame with a datetime index ranging from '2021-01-01' to '2021-12-31' and one column filled with random integers. Compute the rolling mean with a window of 7 days.

### Assignment 7: MultiIndex DataFrame

1. Create a Pandas DataFrame with a MultiIndex (hierarchical index). Perform some basic indexing and slicing operations on the MultiIndex DataFrame.
2. Create a Pandas DataFrame with MultiIndex consisting of 'Category' and 'SubCategory'. Fill the DataFrame with random data and compute the sum of values for each 'Category' and 'SubCategory'.

### Assignment 8: Pivot Tables

1. Create a Pandas DataFrame with columns 'Date', 'Category', and 'Value'. Create a pivot table to compute the sum of 'Value' for each 'Category' by 'Date'.
2. Create a Pandas DataFrame with columns 'Year', 'Quarter', and 'Revenue'. Create a pivot table to compute the mean 'Revenue' for each 'Quarter' by 'Year'.

### Assignment 9: Applying Functions

1. Create a Pandas DataFrame with 3 columns and 5 rows filled with random integers. Apply a function that doubles the values of the DataFrame.
2. Create a Pandas DataFrame with 3 columns and 6 rows filled with random integers. Apply a lambda function to create a new column that is the sum of the existing columns.

### Assignment 10: Working with Text Data

1. Create a Pandas Series with 5 random text strings. Convert all the strings to uppercase.
2. Create a Pandas Series with 5 random text strings. Extract the first three characters of each string.


In [7]:
import pandas as pd
import numpy as np

#Solution 1
df = pd.DataFrame(np.random.randint( 1,100, size=(6,4)), columns=list('ABCD'))
print(df)

    A   B   C   D
0  66  68  93  19
1  38  90  60  61
2  42  49  94  54
3  29  94  50  51
4  68  15  32  96
5  90  71   5  24


In [3]:
df.set_index('A', inplace=True)

In [6]:
print("Data Frame with new index:")
print(df)

Data Frame with new index:
     B   C   D
A             
40   7  30  64
82  11   7  24
79  59  30   9
67  27  14  22
2   33  91  38
90  89  42   7


In [8]:
#Solution2
df = pd.DataFrame(np.random.randint( 1,100, size=(3,3)), columns=list('ABC'), index=list('XYZ'))
df

Unnamed: 0,A,B,C
X,92,71,8
Y,12,7,32
Z,75,55,34


In [12]:
print(df.at['Y','B'])

7


In [13]:
#Solution3
df = pd.DataFrame(np.random.randint( 1,100, size=(5,3)), columns=list('ABC'))

print(df)

    A   B   C
0  63  82  40
1  13  86  81
2  40  16  65
3  41  64  71
4  68  76  94


In [17]:
df['D'] = df['A'] * df['B']

In [19]:
#DataFrame with new column D
df

Unnamed: 0,A,B,C,D
0,63,82,40,5166
1,13,86,81,1118
2,40,16,65,640
3,41,64,71,2624
4,68,76,94,5168


In [24]:
#Solution 4
df = pd.DataFrame(np.random.randint(1,100, size=(4,3)), columns=['A','B','C'])
print("Original Array\n",df)

row_sum = df.sum(axis=1)
column_sum = df.sum(axis=0)

print("Row Sum: \n", row_sum)
print("Column Sum: \n", column_sum)

Original Array
     A   B   C
0  57  80  79
1  97  68  15
2  77  98  33
3  31  77  20
Row Sum: 
 0    216
1    180
2    208
3    128
dtype: int64
Column Sum: 
 A    262
B    323
C    147
dtype: int64


In [30]:
#Solution5

df = pd.DataFrame(np.random.randint(1,100, size=(5,3)), columns=['A','B','C'])
print("Original DataFrame: \n",df)

df.iloc[0,1]=np.nan
df.iloc[2,2]=np.nan
df.iloc[4,0]=np.nan
df.iloc[3,2]=np.nan
df

Original DataFrame: 
     A   B   C
0  44  64  49
1  83  47  81
2  29   1  43
3  87  69  11
4  48  47  58


Unnamed: 0,A,B,C
0,44.0,,49.0
1,83.0,47.0,81.0
2,29.0,1.0,
3,87.0,69.0,
4,,47.0,58.0


In [31]:
df.fillna(df.mean(), inplace=True)
df

Unnamed: 0,A,B,C
0,44.0,41.0,49.0
1,83.0,47.0,81.0
2,29.0,1.0,62.666667
3,87.0,69.0,62.666667
4,60.75,47.0,58.0


In [34]:
#Solution 6
df = pd.DataFrame(np.random.randint(1,101, size=(6,4)),columns=['A','B','C','D'])
print("Original DataFrame: \n",df)


df.iloc[0,1]=np.nan
df.iloc[4,3]=np.nan
df.iloc[5,1]=np.nan
df.iloc[5,3]=np.nan
print("DATAFRAME with NaN values")
print(df)

Original DataFrame: 
     A   B    C   D
0  86  93    9  72
1  48  26  100  92
2  54  23    2  39
3  51  67   74  86
4   7  90   16  49
5  66  94   16  74
DATAFRAME with NaN values
    A     B    C     D
0  86   NaN    9  72.0
1  48  26.0  100  92.0
2  54  23.0    2  39.0
3  51  67.0   74  86.0
4   7  90.0   16   NaN
5  66   NaN   16   NaN


In [35]:
df.dropna(inplace=True)

In [38]:
print("DataFrame after dropping rows with NaN values\n",df)

DataFrame after dropping rows with NaN values
     A     B    C     D
1  48  26.0  100  92.0
2  54  23.0    2  39.0
3  51  67.0   74  86.0


In [41]:
#Solution 7
df = pd.DataFrame({'Categories':np.random.choice(['A','B','C'], size=10), 'Values': np.random.randint(1,100, size=10)}, columns=['Categories', 'Values'])
df

Unnamed: 0,Categories,Values
0,C,2
1,B,35
2,A,37
3,A,55
4,C,97
5,C,45
6,C,9
7,A,75
8,B,1
9,C,44


In [45]:
grouped_sum = df.groupby('Categories')['Values'].agg(['sum', 'mean'])

In [46]:
grouped_sum

Unnamed: 0_level_0,sum,mean
Categories,Unnamed: 1_level_1,Unnamed: 2_level_1
A,167,55.666667
B,36,18.0
C,197,39.4


In [47]:
#Solution 8
df = pd.DataFrame({'Product':np.random.choice(['A','B','C','D'], size=10), 'Category':np.random.choice(['X','Y','Z'], size=10), 'Sales':np.random.randint(101,201,size=10)}, columns=['Product', 'Category', 'Sales'])

df

Unnamed: 0,Product,Category,Sales
0,B,Z,122
1,A,Y,162
2,C,Z,116
3,C,Y,122
4,A,X,200
5,C,Y,140
6,A,Y,139
7,C,X,151
8,A,Y,159
9,C,Y,160


In [49]:
grouped = df.groupby('Category')['Sales'].sum()
print("Grouped Dataframe:")
print(grouped)

Grouped Dataframe:
Category
X    351
Y    882
Z    238
Name: Sales, dtype: int32


In [58]:
#Solution 9
df1 = pd.DataFrame({'Key':['A','B','C','D'], 'Values1':np.random.randint(1,101, size=4)}, columns=['Key', 'Values1'])
df2 = pd.DataFrame({'Key':['A','B','C','E'], 'Values2':np.random.randint(1,101, size=4)}, columns=['Key', 'Values2'])

print(df1)
print(df2)

  Key  Values1
0   A       48
1   B        8
2   C       30
3   D       45
  Key  Values2
0   A       10
1   B       36
2   C       47
3   E       43


In [59]:
merged = pd.merge(df1, df2, on='Key')

In [60]:
print("Merged DataFrame: \n", merged)

Merged DataFrame: 
   Key  Values1  Values2
0   A       48       10
1   B        8       36
2   C       30       47


In [62]:
#Solution10
# Create two Pandas DataFrames with different columns
df1 = pd.DataFrame({'A': np.random.randint(1, 100, size=3), 'B': np.random.randint(1, 100, size=3)})
df2 = pd.DataFrame({'C': np.random.randint(1, 100, size=3), 'D': np.random.randint(1, 100, size=3)})

In [66]:
concat_rows = pd.concat([df1, df2], axis=0)
concat_columns = pd.concat([df1,df2], axis=1)

In [67]:
print("Concat Rows:\n", concat_rows)
print("Concat Columns:\n", concat_columns)

Concat Rows:
       A     B     C     D
0  90.0  68.0   NaN   NaN
1  98.0   4.0   NaN   NaN
2  94.0  81.0   NaN   NaN
0   NaN   NaN  50.0  39.0
1   NaN   NaN  58.0  51.0
2   NaN   NaN  76.0  14.0
Concat Columns:
     A   B   C   D
0  90  68  50  39
1  98   4  58  51
2  94  81  76  14


In [68]:
#Solution 11
# Create a Pandas DataFrame with a datetime index and one column filled with random integers
date_rng = pd.date_range(start='2024-01-01', end='2024-12-31', freq='D')
df = pd.DataFrame(date_rng, columns=['Date'])
df['Data'] = np.random.randint(1,101, size=len(date_rng))
df.set_index('Date', inplace=True)
print("Original DataFrame:\n", df)

Original DataFrame:
             Data
Date            
2024-01-01    85
2024-01-02    47
2024-01-03    17
2024-01-04    73
2024-01-05    98
...          ...
2024-12-27    75
2024-12-28    48
2024-12-29    22
2024-12-30    79
2024-12-31    19

[366 rows x 1 columns]


In [70]:
monthly_mean = df.resample('ME').mean()
print("Monthly Mean:\n", monthly_mean)

Monthly Mean:
                  Data
Date                 
2024-01-31  61.064516
2024-02-29  63.551724
2024-03-31  54.516129
2024-04-30  56.500000
2024-05-31  47.129032
2024-06-30  50.166667
2024-07-31  48.354839
2024-08-31  48.516129
2024-09-30  52.000000
2024-10-31  55.129032
2024-11-30  54.166667
2024-12-31  43.193548


In [76]:
#Solution 12
date_rng = pd.date_range(start='2021-01-01', end='2021-12-31', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(1,101, size=len(date_rng))
df.set_index('date', inplace=True)
print("Original DataFrame:\n", df)


# Compute the rolling mean with a window of 7 days
rolling_mean = df.rolling(window=7).mean()
print("Rolling Mean:\n", rolling_mean)

Original DataFrame:
             data
date            
2021-01-01    21
2021-01-02    39
2021-01-03    70
2021-01-04    21
2021-01-05    37
...          ...
2021-12-27    43
2021-12-28    72
2021-12-29     9
2021-12-30    64
2021-12-31    92

[365 rows x 1 columns]
Rolling Mean:
                  data
date                 
2021-01-01        NaN
2021-01-02        NaN
2021-01-03        NaN
2021-01-04        NaN
2021-01-05        NaN
...               ...
2021-12-27  62.428571
2021-12-28  60.285714
2021-12-29  56.714286
2021-12-30  55.857143
2021-12-31  62.714286

[365 rows x 1 columns]


### MultiIndex DataFrame

In [77]:
#Solution 13
arrays=[['A','A','B','B'],['one','two','one','two']]
index=pd.MultiIndex.from_arrays(arrays, names=('Category','SubCategory'))
df =pd.DataFrame(np.random.randint(1,101,size=(4,3)), index=index, columns=['Value1','Value2','Value3'])

print("MultiIndex DataFrame:")
print(df)

MultiIndex DataFrame:
                      Value1  Value2  Value3
Category SubCategory                        
A        one               9      68      96
         two              31      95      71
B        one              89      82      20
         two              92      90      42


In [84]:
df.loc['A','one']

Value1     9
Value2    68
Value3    96
Name: (A, one), dtype: int32

In [87]:
print(df.iloc[0,2])

96


In [88]:
df.shape

(4, 3)

In [89]:
# Basic indexing and slicing operations
print("Indexing at Category 'A':")
print(df.loc['A'])

print("Slicing at Category 'B' and SubCategory 'two':")
print(df.loc[('B', 'two')])

Indexing at Category 'A':
             Value1  Value2  Value3
SubCategory                        
one               9      68      96
two              31      95      71
Slicing at Category 'B' and SubCategory 'two':
Value1    92
Value2    90
Value3    42
Name: (B, two), dtype: int32


In [90]:
#Solution 14
arrays = [['A', 'A', 'B', 'B', 'C', 'C'], ['one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays, names=('Category', 'SubCategory'))
df = pd.DataFrame(np.random.randint(1, 100, size=(6, 3)), index=index, columns=['Value1', 'Value2', 'Value3'])
print("MultiIndex DataFrame:")
print(df)

# Compute the sum of values for each 'Category' and 'SubCategory'
sum_values = df.groupby(['Category', 'SubCategory']).sum()
print("Sum of values:")
print(sum_values)

MultiIndex DataFrame:
                      Value1  Value2  Value3
Category SubCategory                        
A        one              39      35      95
         two              10      20      19
B        one              80      51      60
         two              69      36      86
C        one               1      52      53
         two              50       8      51
Sum of values:
                      Value1  Value2  Value3
Category SubCategory                        
A        one              39      35      95
         two              10      20      19
B        one              80      51      60
         two              69      36      86
C        one               1      52      53
         two              50       8      51


In [91]:
#Solution 15
date_rng = pd.date_range(start='2024-01-01', end='2024-01-10', freq='D')
df = pd.DataFrame({'Date':np.random.choice(date_rng, size=20),'Category':np.random.choice(['A','B','C'],size=20),'Values':np.random.randint(1,101, size=20)})
print("Original DataFrame:\n", df)


Original DataFrame:
          Date Category  Values
0  2024-01-05        C       3
1  2024-01-02        C      51
2  2024-01-09        A      53
3  2024-01-05        A      41
4  2024-01-07        B      18
5  2024-01-05        B      34
6  2024-01-08        A      78
7  2024-01-07        A      35
8  2024-01-02        A      51
9  2024-01-06        C      75
10 2024-01-01        A      19
11 2024-01-03        A      79
12 2024-01-09        A       7
13 2024-01-07        A      89
14 2024-01-07        B      46
15 2024-01-04        B      24
16 2024-01-03        B      71
17 2024-01-08        C      78
18 2024-01-07        C      78
19 2024-01-07        C      74


In [94]:
# Create a pivot table to compute the sum of 'Value' for each 'Category' by 'Date'
pivot_table= df.pivot_table(index='Date', values='Values', columns='Category', aggfunc='sum')
print("Pivot table:\n", pivot_table)

Pivot table:
 Category        A     B      C
Date                          
2024-01-01   19.0   NaN    NaN
2024-01-02   51.0   NaN   51.0
2024-01-03   79.0  71.0    NaN
2024-01-04    NaN  24.0    NaN
2024-01-05   41.0  34.0    3.0
2024-01-06    NaN   NaN   75.0
2024-01-07  124.0  64.0  152.0
2024-01-08   78.0   NaN   78.0
2024-01-09   60.0   NaN    NaN


In [100]:
df = pd.DataFrame({'Year':np.random.choice([2021,2023,2022], size=20), 'Quarter':np.random.choice(['Q1','Q2','Q3','Q4'], size=20), 'Revenue':np.random.randint(100000,500000, size=20)})

print('Original DataFrame:\n', df)

pivot_table = df.pivot_table(columns='Quarter',index='Year', values='Revenue', aggfunc='mean')
print("Pivot Table:")
print(pivot_table)

Original DataFrame:
     Year Quarter  Revenue
0   2022      Q2   109883
1   2022      Q4   470460
2   2022      Q3   346520
3   2023      Q3   473143
4   2021      Q3   115136
5   2023      Q1   162579
6   2022      Q2   246729
7   2023      Q2   419565
8   2021      Q4   237055
9   2021      Q2   118158
10  2022      Q4   357930
11  2021      Q4   458314
12  2021      Q3   450389
13  2022      Q3   393789
14  2023      Q3   309351
15  2021      Q1   275472
16  2021      Q1   115722
17  2023      Q2   397306
18  2021      Q3   217144
19  2023      Q1   198941
Pivot Table:
Quarter        Q1        Q2             Q3        Q4
Year                                                
2021     195597.0  118158.0  260889.666667  347684.5
2022          NaN  178306.0  370154.500000  414195.0
2023     180760.0  408435.5  391247.000000       NaN


In [108]:
#Solution 17
df = pd.DataFrame(np.random.randint(1,10, size=(5,3)), columns=['A','B','C'])
print("Original DataFrame:\n", df)

df_doubled = df.apply(lambda x:x*2)
print("Doubled DataFrame:\n", df_doubled)

Original DataFrame:
    A  B  C
0  9  3  7
1  9  4  2
2  7  8  2
3  4  4  1
4  3  5  3
Doubled DataFrame:
     A   B   C
0  18   6  14
1  18   8   4
2  14  16   4
3   8   8   2
4   6  10   6


In [114]:
df = pd.DataFrame(np.random.randint(1,10, size=(6,3)), columns=['A','B','C'])
print("Original DataFrame:\n", df)

Original DataFrame:
    A  B  C
0  9  9  1
1  1  6  1
2  9  4  7
3  5  5  2
4  9  3  5
5  3  1  9


In [115]:
df['Sum'] = df.apply(lambda row: row.sum(), axis=1)

In [116]:
df

Unnamed: 0,A,B,C,Sum
0,9,9,1,19
1,1,6,1,8
2,9,4,7,20
3,5,5,2,12
4,9,3,5,17
5,3,1,9,13


In [122]:
#Solution 19
text_data = pd.Series(['Apple','Banana','Cherry','DragonFruit','Elderberry'])
text_data

0          Apple
1         Banana
2         Cherry
3    DragonFruit
4     Elderberry
dtype: object

In [123]:
uppercase_data = text_data.str.upper()

In [124]:
print(uppercase_data)

0          APPLE
1         BANANA
2         CHERRY
3    DRAGONFRUIT
4     ELDERBERRY
dtype: object


In [125]:
#Solution 20
# Extract the first three characters of each string
first_three_chars = text_data.str[:3]
print("First three characters:")
print(first_three_chars)

First three characters:
0    App
1    Ban
2    Che
3    Dra
4    Eld
dtype: object
