# Indexing, selecting, assigning reference
* Selecting specific values of a pandas DataFrame or Series to work on is an implicit step in almost any data operation you'll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.


## Loading Data to work on
* We will use the sugar data set to work on
* [Sugar Data set](https://www.kaggle.com/datasets/muhammadtalhaawan/world-sugar-dataset-2018-2024
)

In [3]:
# import pandas 
import pandas as pd
# load sugar data set
consumptionOfSugar = pd.read_csv('sugar_data_set\consumption_df.csv')
consumptionOfSugar.head()

Unnamed: 0,Name,2018/19,2019/20,2020/21,2021/22,2022/23,May2023/24,Action
0,India,27500,27000,28000,29000,29500,31000,consumption
1,European_Union,17000,17000,16700,17000,17000,17000,consumption
2,China,15800,15400,15500,14800,15500,15600,consumption
3,United_States,10982,11109,11032,11314,11498,11499,consumption
4,Brazil,10600,10650,10150,9500,9500,9542,consumption


## 1. Indexing and Selecting Data
* We can select a column existing in a DataFrame by using the column name as a key:
```python
df['column_name']
```
* We can also select a column existing in a DataFrame by using the column name as an attribute:
```python
df.column_name
```
* We can select a row existing in a DataFrame by using the row index as a key:
```python
df.loc[row_index]
```
* We can select a row existing in a DataFrame by using the row index as an attribute:
```python
df.iloc[row_index]
```

## 2. Selecting a certain entry.
* We can select a certain entry in a DataFrame by using the row index and column name as a key:
```python
df.loc[row_index, 'column_name']
```

* or we can use iloc to select a certain entry in a DataFrame by using the row index and column index as a key:
```python
df.iloc[row_index, column_index]
```





In [11]:
# Access the 2018/19 column by name
# consumptionOfSugar['2018/19']
# consumptionOfSugar.loc[:, '2018/19'] -> loc is used to access the column by name
consumptionOfSugar.iloc[:, 1]  # iloc is used to access the column by index
# consumptionOfSugar.2018/19 -> this will not work because of the slash
consumptionOfSugar.Action # this will work.

0     consumption
1     consumption
2     consumption
3     consumption
4     consumption
5     consumption
6     consumption
7     consumption
8     consumption
9     consumption
10    consumption
11    consumption
12    consumption
13    consumption
14    consumption
15    consumption
16    consumption
17    consumption
18    consumption
19    consumption
20    consumption
21    consumption
22    consumption
23    consumption
24    consumption
25    consumption
26    consumption
Name: Action, dtype: object

In [16]:
# selecting certain entry
consumptionOfSugar.loc[1, '2018/19'] #? using loc function
consumptionOfSugar.iloc[1,3] #? using iloc function
consumptionOfSugar.Action[0] #? using dot notation
consumptionOfSugar['2018/19'][0] #? using bracket notation

'27,500'

## .loc vs .iloc
* .loc gets rows (or columns) with particular labels from the index.
* .iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
* Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

* This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following: /n

> ``` consumptionOfSugar.iloc[:, 0] ```

### Choosing between loc and iloc
* When choosing or transitioning between loc and iloc, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.
 
* iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.
 
* Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet'] (t coming after s in the alphabet).
 
* This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].
 
* Otherwise, the semantics of using loc are the same as those for iloc
> To put it in a nut shell, loc is inclusive and iloc is exclusive.

In [18]:
consumptionOfSugar.iloc[:, 0] 


0              India
1     European_Union
2              China
3      United_States
4             Brazil
5          Indonesia
6             Russia
7           Pakistan
8             Mexico
9              Egypt
10            Turkey
11          Thailand
12        Bangladesh
13              Iran
14           Vietnam
15       Philippines
16          Malaysia
17          Ethiopia
18             Japan
19           Algeria
20          Colombia
21             Sudan
22      South_Africa
23       Korea_South
24           Nigeria
25             Other
26             Total
Name: Name, dtype: object

* Also we can select some of the data not all of them by defining the rows we want, then the index of the column we want to select.

In [22]:
consumptionOfSugar.iloc[:3, 0] #? This will retrieve the first three entries of the first column
# it is also possible to define the indices we want in certain list, and pass it to i loc function
consumptionOfSugar.iloc[[4,11,2] , 3]
# to get last 5 elements we use negative indexing
consumptionOfSugar.iloc[-5:, 0] #? This will retrieve the first three entries of the first column

22    South_Africa
23     Korea_South
24         Nigeria
25           Other
26           Total
Name: Name, dtype: object

In [25]:
# we can use columns names, to acces the data by them
consumptionOfSugar.loc[:5, '2018/19'] #? this will retrieve the first 5 elements, in the column 2018/19
 
# we can get many columns at once
consumptionOfSugar.loc[:5, ['2018/19', '2019/20', '2020/21']] #? this will retrieve the first 5 elements, in the column 2018/19

Unnamed: 0,2018/19,2019/20,2020/21
0,27500,27000,28000
1,17000,17000,16700
2,15800,15400,15500
3,10982,11109,11032
4,10600,10650,10150
5,7055,7356,7445


## Manipulating The index
* Sometimes, we need to change the name of the index, or even change the index to be one of the columns in the data set.
* We can do that by using the following code:
```python
df.set_index('column_name')
```
* We can also reset the index to be the default index by using the following code:
```python
df.reset_index()
```


## Conditional Selection
* We can select a certain data by using a condition, for example, we can select all the data that is greater than 100 by using the following code:
```python
df[df['column_name'] > 100]
```
* We can also select all the data that is greater than 100 and less than 200 by using the following code:
```python
df[(df['column_name'] > 100) & (df['column_name'] < 200)]
```
* The return of the condition is a boolean series, so we can use it to select the data we want, and use it as a bit indexing.
* Usually we do this to be able to apply some statistical functions on the data we want, for example, we can select all the data that is greater than 100 and less than 200, then we can apply the mean function on the data we selected by using the following code:
```python
df[(df['column_name'] > 100) & (df['column_name'] < 200)].mean()
```


In [38]:
# convert all elements in the series into into from str
consumptionOfSugar[consumptionOfSugar['2018/19'] != '26,000']

Unnamed: 0,Name,2018/19,2019/20,2020/21,2021/22,2022/23,May2023/24,Action
0,India,27500,27000,28000,29000,29500,31000,consumption
1,European_Union,17000,17000,16700,17000,17000,17000,consumption
2,China,15800,15400,15500,14800,15500,15600,consumption
3,United_States,10982,11109,11032,11314,11498,11499,consumption
4,Brazil,10600,10650,10150,9500,9500,9542,consumption
5,Indonesia,7055,7356,7445,7600,7800,7900,consumption
6,Russia,6110,6820,5804,6350,6500,6450,consumption
7,Pakistan,5400,5540,5750,6000,6150,6300,consumption
8,Mexico,4317,4349,4171,4342,4330,4414,consumption
9,Egypt,3100,3250,3340,3430,3320,3400,consumption


### built-in conditional selectors
* There are two built-in conditional selectors, isin and isnull.
* isin is used to select the data that is in a certain list, for example, we can select all the data that is in the list [1, 2, 3] by using the following code:
```python
df[df['column_name'].isin([1, 2, 3])]
```
* isnull is used to select the data that is null, for example, we can select all the data that is null by using the following code:
```python
df[df['column_name'].isnull()]
```


### Assign data
* We can assign data to a certain column by using the following code:
```python
df['column_name'] = data
```
* We can also assign data to a certain row by using the following code:
```python
df.loc[row_index] = data
```
