# Summary Functions And Maps
* After reading the data, so need to reformate it, because usually, the data which we deal with is not so clean.
* So we need to apply some different operations on them, to make it efficient to the task we are working on.

In [2]:
# Importing data that we will work on. 
import pandas as pd
ProducedSugar = pd.read_csv('sugar_data_set\production_df.csv')
ProducedSugar.head()


# ProducedSugar['2018/19'].map(lambda d: 1 if d.find(',') else 9)

#convert each entry to a float, by detecting the comma and removing it, then cast to float
for item in ProducedSugar.columns[1: -1]:
    ProducedSugar[item] = ProducedSugar[item].str.replace(',','').astype(float)

ProducedSugar[:]


Unnamed: 0,Name,2018/19,2019/20,2020/21,2021/22,2022/23,May2023/24,Action
0,Brazil,29500.0,30300.0,42050.0,35450.0,38050.0,42010.0,production
1,India,34300.0,28900.0,33760.0,36880.0,32000.0,36000.0,production
2,European_Union,16750.0,17040.0,15216.0,16497.0,14899.0,15475.0,production
3,Thailand,14581.0,8294.0,7587.0,10157.0,11040.0,11200.0,production
4,China,10760.0,10400.0,10600.0,9600.0,9000.0,10000.0,production
5,United_States,8164.0,7392.0,8376.0,8307.0,8420.0,8369.0,production
6,Pakistan,5270.0,5340.0,6505.0,7560.0,6860.0,7110.0,production
7,Russia,6080.0,7800.0,5625.0,6000.0,7184.0,6336.0,production
8,Mexico,6812.0,5596.0,6058.0,6556.0,5708.0,6254.0,production
9,Australia,4725.0,4285.0,4335.0,4120.0,4200.0,4400.0,production


## Summary Functions
* Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way.
* here are bunch of them:
  * describe(): which gives a quick look at the numerical values of the data.
  * mean(): which gives the average of the data.
  * unique(): which gives the unique values in the data.
  * value_counts(): which gives the number of times each unique value appears in the data.
  * apply(): which applies a function to each element in a column.
  * map(): which applies a function to each element in a series.
  * applymap(): which applies a function to each element in a dataframe.
  * groupby(): which is a complicated function, which we will see in the next section.
  * pivot_table(): which is a complicated function, which we will see in the next section.
  * melt(): which is a complicated function, which we will see in the next section.
  * stack(): which is a complicated function, which we will see in the next section.
  * unstack(): which is a complicated function, which we will see in the next section.
  * 

In [3]:

def desc_col (df):
    for col in df.columns:
        print(df[col].describe())
        print('\n')
desc_col(ProducedSugar)
'''
This method generates a high-level summary of the attributes of the given column. 
It is type-aware, meaning that its output changes based on the data type of the input.
The output below only makes sense for numerical data; for string data here's what we get:
count -> number of elements
mean -> average of all elements
std -> standard deviation
min -> minimum element
25% -> 25 percentile
50% -> 50 percentile
75% -> 75 percentile
max -> maximum element
'''

count         27
unique        27
top       Brazil
freq           1
Name: Name, dtype: object


count        27.000000
mean      13270.962963
std       34239.698824
min         780.000000
25%        1661.500000
50%        2700.000000
75%        9462.000000
max      179158.000000
Name: 2018/19, dtype: float64


count        27.000000
mean      12337.703704
std       31813.793558
min         825.000000
25%        1694.000000
50%        2750.000000
75%        8047.000000
max      166559.000000
Name: 2019/20, dtype: float64


count        27.000000
mean      13341.777778
std       34745.474063
min         750.000000
25%        1682.500000
50%        2780.000000
75%        7981.500000
max      180114.000000
Name: 2020/21, dtype: float64


count        27.000000
mean      13376.518519
std       34708.587504
min         807.000000
25%        1650.000000
50%        2650.000000
75%        8953.500000
max      180583.000000
Name: 2021/22, dtype: float64


count        27.000000
mean      13131.7

"\nThis method generates a high-level summary of the attributes of the given column. \nIt is type-aware, meaning that its output changes based on the data type of the input.\nThe output below only makes sense for numerical data; for string data here's what we get:\ncount -> number of elements\nmean -> average of all elements\nstd -> standard deviation\nmin -> minimum element\n25% -> 25 percentile\n50% -> 50 percentile\n75% -> 75 percentile\nmax -> maximum element\n"

In [4]:
ProducedSugar['2018/19'].value_counts()

29500.0     1
2400.0      1
15082.0     1
788.0       1
780.0       1
1300.0      1
1133.0      1
1262.0      1
1753.0      1
1520.0      1
1570.0      1
2100.0      1
2257.0      1
2966.0      1
34300.0     1
2200.0      1
2405.0      1
2700.0      1
4725.0      1
6812.0      1
6080.0      1
5270.0      1
8164.0      1
10760.0     1
14581.0     1
16750.0     1
179158.0    1
Name: 2018/19, dtype: int64

## Maps
* A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

* There are two mapping methods that you will use often.
    1. map()
    2. apply()

### map(): 
* The first, map(), is the simplest one. For example, suppose that we wanted to remean the scores the wines recieved to 0. We can do this as follows:
```python
reviews.points.map(lambda point: point - reviews.points.mean())
```
* It returns a new Series.
* to be able to modify the data, we need to assign it to a new column.
* It iterate over each element in the given series, and apply the lambda function on it.

In [5]:
mean = ProducedSugar['2019/20'].mean()
ProducedSugar['2019/20'] = ProducedSugar['2019/20'].map(lambda score : 1 if score > mean else 0)
ProducedSugar['2019/20'].value_counts()

0    22
1     5
Name: 2019/20, dtype: int64

## apply():
* It is similar to map(), but it is used for dataframes.
* we use the same function over all rows or columns.
* and we specify this using the axis parameter.
  * to apply it on all rows -> axis= 'columns' or axis= 1
  * to apply it on all columns -> axis= 'index' or axis= 0

In [6]:

ProducedSugar.head()

Unnamed: 0,Name,2018/19,2019/20,2020/21,2021/22,2022/23,May2023/24,Action
0,Brazil,29500.0,1,42050.0,35450.0,38050.0,42010.0,production
1,India,34300.0,1,33760.0,36880.0,32000.0,36000.0,production
2,European_Union,16750.0,1,15216.0,16497.0,14899.0,15475.0,production
3,Thailand,14581.0,0,7587.0,10157.0,11040.0,11200.0,production
4,China,10760.0,0,10600.0,9600.0,9000.0,10000.0,production


In [12]:
def remean_points(col):
    # axis = 1
    # this is how we can iterate over each row. 
    return col

def on_each_row (col):
    # axis = 0
    # This is how can we iterate over each column
    print(col.name)
    col.name = col.name.replace(col.name, col.name + 'updated')
    print(col.name)
    return col

ProducedSugar.apply(remean_points, axis=1)
# ProducedSugar.apply(on_each_row, axis = 0)

1
42050.0
1
33760.0
1
15216.0
0
7587.0
0
10600.0
0
8376.0
0
6505.0
0
5625.0
0
6058.0
0
4335.0
0
3100.0
0
2780.0
0
2130.0
0
2565.0
0
2240.0
0
2106.0
0
2143.0
0
1830.0
0
1535.0
0
1240.0
0
1197.0
0
985.0
0
750.0
0
815.0
0
784.0
1
13802.0
1
180114.0


Unnamed: 0,Name,2018/19,2019/20,2020/21,2021/22,2022/23,May2023/24,Action
0,Brazil,29500.0,1,42050.0,35450.0,38050.0,42010.0,production
1,India,34300.0,1,33760.0,36880.0,32000.0,36000.0,production
2,European_Union,16750.0,1,15216.0,16497.0,14899.0,15475.0,production
3,Thailand,14581.0,0,7587.0,10157.0,11040.0,11200.0,production
4,China,10760.0,0,10600.0,9600.0,9000.0,10000.0,production
5,United_States,8164.0,0,8376.0,8307.0,8420.0,8369.0,production
6,Pakistan,5270.0,0,6505.0,7560.0,6860.0,7110.0,production
7,Russia,6080.0,0,5625.0,6000.0,7184.0,6336.0,production
8,Mexico,6812.0,0,6058.0,6556.0,5708.0,6254.0,production
9,Australia,4725.0,0,4335.0,4120.0,4200.0,4400.0,production


In [8]:
descriptor_counts = 0
