# Big Data Real-Time Analytics with Python and Spark

## Chapter 3 - Data Manipulation in Python with Pandas
- Documentation: https://pandas.pydata.org/
- Convert a matrix to a dataframe
- Extract Series of a dataframe
- Create Series (We can specify the data type and index)
- See the index and values of the Series
- See the data types of the series (We can have different data types in an Serie)
- Operations with Series
- How to tranform a numpy array, list and a dictionary in Series
- Some methods: .unique(), nunique(), .nlargest(), .nsmallest(), isna(), .eq(), .gt(), .mean()
- How to create nan values to replace invalid characters
- We can not use value_counts() to count nan values. Use "isna().sum()"
- .agg(['mean', 'median', 'sum', 'count']))

In [1]:
# Python version
from platform import python_version
print('The version used in this notebook is: ', python_version())

The version used in this notebook is:  3.8.8


In [4]:
# Import the pandas and Numpy module
import pandas as pd
import numpy as np

In [5]:
# package version used in this notebook
%reload_ext watermark
%watermark -a "Bianca Amorim" --iversions

Author: Bianca Amorim

numpy : 1.23.3
pandas: 1.5.0



## Series Operations

In [6]:
# List of lists (matrix)
matrix = [[9, 4, 3], [2, 6, 1], [7, 5, 8]]

In [12]:
matrix

[[9, 4, 3], [2, 6, 1], [7, 5, 8]]

In [8]:
# Convert a matrix to a dataframe
df = pd.DataFrame(matrix)

In [9]:
# Data type
type(df)

pandas.core.frame.DataFrame

In [11]:
# Extract the line of index 0 (when we extract a line or column the object becomes a Serie)
line = df.iloc[0, :]

In [13]:
type(line)

pandas.core.series.Series

In [15]:
coluna = df[0]

In [16]:
type(coluna)

pandas.core.series.Series

In [17]:
# Create a Serie
serie_pandas = pd.Series(['a', 'b', 'c', 'd', 'e'])

In [18]:
serie_pandas

0    a
1    b
2    c
3    d
4    e
dtype: object

In [19]:
# View the index
serie_pandas.index

RangeIndex(start=0, stop=5, step=1)

In [20]:
# View the values
serie_pandas.values

array(['a', 'b', 'c', 'd', 'e'], dtype=object)

In [21]:
# View data type (O = Object = string = texto)
serie_pandas.dtype

dtype('O')

In [22]:
# View shape (We can not say 5 lines - it is a un one-dimensional structure with 5 elements)
serie_pandas.shape

(5,)

In [23]:
# Pandas put automatically the index, but we can customize if we want
serie_pandas = pd.Series(['a', 'b', 'c', 10, True], index = [10, 20, 30, 40, 50])

In [24]:
serie_pandas.index

Int64Index([10, 20, 30, 40, 50], dtype='int64')

In [25]:
serie_pandas.dtype

dtype('O')

In [26]:
serie_pandas.values

array(['a', 'b', 'c', 10, True], dtype=object)

In [27]:
serie_pandas

10       a
20       b
30       c
40      10
50    True
dtype: object

In [29]:
serie_pandas = pd.Series(['a', 'b', 'c', 10, True])

In [30]:
serie_pandas

0       a
1       b
2       c
3      10
4    True
dtype: object

In [31]:
# We can have a serie with many types of data
# But if we call the type they answer with what appear most often 
serie_pandas.dtype

dtype('O')

In [32]:
print(serie_pandas[0])

a


In [33]:
type(serie_pandas[0])

str

In [34]:
print(serie_pandas[3])

10


In [35]:
type(serie_pandas[3])

int

In [36]:
print(serie_pandas[4])

True


In [37]:
type(serie_pandas[4])

bool

In [39]:
# We can not do the same with the index 2 because it is a string
result = serie_pandas[3] + 1

In [40]:
print(result)

11


In [42]:
list = ['Data', 'Science', 'Academy']

In [44]:
# Convert my list to a Series
serie_pandas = pd.Series(list)

In [45]:
print(serie_pandas)

0       Data
1    Science
2    Academy
dtype: object


In [46]:
# With dictionary, they put the key with index
dictionary = {'a': 'Data', 'b': 'Science', 'c': 'Academy'}

In [47]:
serie_pandas = pd.Series(dictionary)

In [48]:
print(dictionary)

{'a': 'Data', 'b': 'Science', 'c': 'Academy'}


In [49]:
print(serie_pandas)

a       Data
b    Science
c    Academy
dtype: object


In [54]:
# Create an Numpy array
arr = np.random.randint(0, 10, size = 5)

In [55]:
print(arr)

[1 8 3 4 7]


In [56]:
serie_pandas = pd.Series(arr)

In [57]:
print(serie_pandas)

0    1
1    8
2    3
3    4
4    7
dtype: int64


In [58]:
serie_pandas = pd.Series({1: 'Data', 2: 'science'})

In [59]:
serie_pandas

1       Data
2    science
dtype: object

In [60]:
# Note: We said the index start with 1, so Data is index 1, not 0.
# If a call the index 0, I will get an error
print(serie_pandas[1])

Data


In [61]:
serie_pandas = pd.Series([1, 2, 3, 4, 5])

In [62]:
print(serie_pandas)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [63]:
# If you do not want the type int64, we can put he dtype whrn you create the Series
serie_pandas = pd.Series([1, 2, 3, 4, 5], dtype = 'float')

In [64]:
print(serie_pandas)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64


In [65]:
serie_pandas = pd.Series(['Blue', 'Yellow', 'White', 'White', 'Green', 'Blue'])

In [66]:
# Unique values
print(serie_pandas.unique())

['Blue' 'Yellow' 'White' 'Green']


In [67]:
# Number of unique values
print(serie_pandas.nunique())

4


In [68]:
serie_pandas = pd.Series([201, 323, 17, 97, 43, 9, 26, 4])

In [69]:
# To know the n largest values (Here we see the 3 largest values)
print(serie_pandas.nlargest(n = 3))

1    323
0    201
3     97
dtype: int64


In [70]:
# The n smallest values
print(serie_pandas.nsmallest(n = 2))

7    4
5    9
dtype: int64


In [71]:
serie_pandas = pd.Series(['Blue', 'Yellow', 'White', 'White', 'Green', 'Blue'])

In [72]:
print(serie_pandas.value_counts())

Blue      2
White     2
Yellow    1
Green     1
dtype: int64


In [73]:
# We can replace values that are not valid, like a "?" to a np.nan
# It ir more easy to treat this values as nan, than with we keep them like a character
# So, sometimes we put nan to make easy the treatment of the data
serie_pandas = pd.Series([1, 2, 3, np.nan, np.nan])

In [74]:
print(serie_pandas.isna())

0    False
1    False
2    False
3     True
4     True
dtype: bool


In [75]:
print(serie_pandas.isna().sum())

2


In [78]:
# Pandas already count the element disregarding the na value
print(serie_pandas.count())

4


**Never use count to count the "na" values, because they never count the "na" value**


In [77]:
serie_pandas = pd.Series([1, 2, 3, 4])

In [79]:
# To find if there is 3 in series
# eq, means equal
print(serie_pandas.eq(3))

0    False
1    False
2     True
3    False
dtype: bool


In [80]:
# gt, is greater than
print(serie_pandas.gt(2))

0    False
1    False
2     True
3     True
dtype: bool


In [81]:
serie_pandas = pd.Series([56, 34, 68, 21, 49])

In [83]:
# Mean 
print(serie_pandas.mean())

45.6


In [84]:
print(serie_pandas.agg(['mean', 'median', 'sum', 'count']))

mean       45.6
median     49.0
sum       228.0
count       5.0
dtype: float64


# The End