## Pandas
is the most popular Python library for data analysis, help dealing with data set and it will extract the data from that CSV into a DataFrame.

<img src='pandas.png'>

### IMPORT OF Pandas :-
- import `pandas` as `np`

In [None]:
import pandas as pd

#### Series & DataFrame

##### 1. Series:
 is considered column in an excel sheet.
##### 2. DataFrame:
 is a table, datasets arrange in rows and columns,
###### If a DataFrame is a table, a Series is a list.

<img src='series-and-dataframe.width-1200.png'>

#### Create series
- created by loading the datasets from existing storage like CSV file, and Excel file. 
- can be created from the lists, dictionary, and from a scalar value etc.

In [None]:
pd.Series([3,2,0,1])

In [None]:
#set index and name
pd.Series([3,2,0,1], index=['Day_1', 'Day_2','Day_3','Day_4'], name='Appels')

#### Create DataFrame
- can be created is an Empty Dataframe by calling a dataframe constructor.
- can be created from dict narray / lists ,Lists and List of Dicts.

In [None]:
# create dict
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}
# keys are the column names 
# values are a list of entries
#then pass it to the pandas DataFrame constructor
mydata_frame= pd.DataFrame(data)
mydata_frame

In [None]:
# set index ( row labels)
import pandas as pd
mydata_frame= pd.DataFrame(data,index=['Day_1', 'Day_2','Day_3','Day_4'])
mydata_frame

###### Q: create a DataFrame by passing a list of dictionaries and the row indices.

### Reading data files
most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.
Data can be stored in different forms and formats. as(txt,csv,Excel...)

- we can read data.csv (Comma-Separated Values) into a DataFrame by `pd.read_csv()` 


In [None]:
'''
INTRO: DATA SET
Medium is one of the most famous tools for spreading knowledge about almost any field. 
It is widely used to published articles on ML, AI, and data science. 
This dataset is the collection of about 338 articles in such fields

'''
data_f=pd.read_csv('articles.csv')
data_f

 We can examine the contents of DataFrame using the ` head() ` command , which display the first five rows by default

In [None]:
data_f.head()

we can use ` tail() `To see the last five rows 

In [None]:
data_f.tail()

In [None]:
#remove index
data_f=pd.read_csv('articles.csv', index_col=0) # set index by columns (using name or index)
# we can use set_index()
#data_f.set_index('title')
data_f.head()


###### Understanding Our Data

In [None]:
# using describe() "summary function"
data_f.reading_time.describe()

In [None]:
#info() function is used to get summary of the dataframe
data_f.info()

In [None]:
data_f.author.describe()

In [None]:
# To see a list of unique values we can use the unique() function
data_f.author.unique()
# note : Although repeated 'Justin Lee' 5 times , There are in unique array
# Each repeated value appears once in unique array

In [None]:
#To see a list of unique values and how often they occur in the dataset
data_f.author.value_counts()

In [None]:
data_f.shape

###### Q:Display a summary of the basic information about  data ('medium_data.csv')

# Accessing Data
Selecting specific values of a pandas DataFrame or Series to work on is in any data operation you'll run

In [None]:
data_f=pd.read_csv('articles.csv')
data_f

###### 1.methode (Columns is attribute of object)
In Python, we can access the property of an object by accessing it as an attribute, Columns in a pandas DataFrame work in much the same way.


In [None]:
data_f.reading_time

###### 2. methode (dataframe is dictionary)
in Python dictionary, we can access its values using the indexing ([]) operator.

- Note:
the indexing operator [] does have the advantage that it can handle column names with reserved characters in them (e.g. if we had a `Publishing house` column, `dataframe.Publishing house` wouldn't work).

In [None]:
data_f['claps']

###### Access entry


In [None]:
data_f['claps'][3]

###### Accessor operators `loc`  and `iloc` 

- `iloc` we treat the dataset like a big matrix (a list of lists) , select data based on its numerical position in the data, this is called 'Index-based selection'.

- `loc` uses the information in the indices and ignores the dataset's indices, this is called ' label-based selection'.


In [None]:
data_f.iloc[0]

In [None]:
data_f.loc[0]

**Q:** Select just the 'subtitle' and 'date' columns from the DataFrame `df`.

###### Conditional selection
For example, what if we want to filter our articles to show only articles wirtten by Daniel Simmons 

In [None]:
data_f=pd.read_csv('articles.csv')
data_f.author=='Daniel Simmons'# return Series of True/False booleans based on the author of each record

In [None]:
#count them
sum(data_f.author=='Daniel Simmons')

In [None]:
data_f.loc[data_f.author=='Daniel Simmons']

In [None]:
data_f.title=='Chatbots were the next big thing: what happened? – The Startup – Medium'

In [None]:
sum(data_f.title=='Chatbots were the next big thing: what happened? – The Startup – Medium')

In [None]:
#show them
data_f.loc[data_f.title=='Chatbots were the next big thing: what happened? – The Startup – Medium']

In [None]:
(data_f['title']=='Chatbots were the next big thing: what happened? – The Startup – Medium').head()

In [None]:
data_f.loc[(data_f.claps=='8.3K')&(data_f.reading_time==11)]

In [None]:
data_f.loc[data_f.reading_time==11].shape

In [None]:
data_f.loc[data_f.claps=='8.3K'].shape

###### Built-in conditional selectors

In [None]:
# use isin() 
data_f.loc[data_f.reading_time.isin([5])]

###### Filtering data using query() method
taking expression to filter data, return Filtered Data 

In [None]:
data_f.query('reading_time > 5') # return articles have reading time greater than 5

**Q:** in data ('medium_data.csv')
- count of only the articles where the number of claps is greater than 1000.
- Select only the articles that it is publicated by Towards Data Science where the number of claps is greater than 1000.

###### Assigning data

In [None]:
import numpy as np
data_f['num_comments']=np.random.randint(1,50,len(data_f))
data_f

**Q:** change value of column `id`  into numbers that has mean = 0.0 in data ('medium_data.csv').

###### Drop specified labels from rows or columns.

In [None]:
# we can remove rows or columns from data by drop()
data_f.drop(['num_comments'], axis=1,inplace=True)
data_f

##### Maps
It is possible to iterate over a DataFrame or Series as you would with a list, but doing so — especially on large datasets — is very slow.

map is function that takes one set of values and "maps" them to another set of values.
We can transform data from the format it is in now to the format that we want it using Maps.


- `apply` method can be applied both to series and dataframes,if we want to transform a whole DataFrame by calling a custom method on each row.

###### NOTE:
 `apply` don't modify the original data they're called on




In [None]:
def str_to_int(strk):
    return float(strk[:-1]) if 'K'in strk else float(strk)/1000.0
    
str_to_int('2.8K')

In [None]:
data_f.loc[:,'claps'].apply(str_to_int)

In [None]:
def str_to_int2(row):
    return float(row.claps[:-1]) if 'K'in row.claps else float(row.claps)/1000.0


data_f.apply(str_to_int, axis=1)
#data_f.map(str_to_int, axis=1) Error

#### Grouping
`groupby()` function is used to split the data into groups based on some criteria.

In [None]:
data_f.groupby('author').sum() #sum values of reading_time for each author

In [None]:
data_f.index

###### Q: Get the least reading time in each title articles category in data ('medium_data.csv').

###### Combining:
we use `concat()` to combine two dataframe and Given a list of dataframe .

In [None]:
data_f2=pd.read_csv('medium_data.csv',nrows=100)# set The same number of rows of data_f ===> data_f.shape[0]
data_f2

In [None]:
article_data=pd.concat([data_f, data_f2])
article_data

###### Dealing with Missing Data 
Entries missing values are given the value NaN.

To select NaN entries we can use `isnull()` .

To remove all columns or rows that contain least one missing value we can use `dropna()` .

To fill in missing values in a dataframe OR specify what we want the NaN values to be replaced with we can use `fillna()` .

In [None]:
article_data.isnull().sum()

In [None]:
#article_data.dropna() #remove all the rows that contain a missing value
article_data.dropna(axis=1) #remove all columns with at least one missing value

In [None]:
article_data.loc[:,['id','url','responses','publication','date']].fillna("Unknown")

**Q:** Replace Missing Data in `subtitle` by text='no subtitle'  in data ('medium_data.csv').