Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect, but it is sufficient for most real-world use. For data with float64 dtype, pandas uses the floating-point value NaN (Not a Number) to represent missing data.

We call this a sentinel value: when present, it indicates a missing (or null) value:



In [3]:
import pandas as pd
import numpy as np

In [4]:
float_data=pd.Series([1.2,-3.5,np.nan,0])

In [5]:
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

In [6]:
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

In [9]:
num=pd.Series(["anish",np.nan,None,"gone"])

In [10]:
num

0    anish
1      NaN
2     None
3     gone
dtype: object

In [11]:
num.isna()

0    False
1     True
2     True
3    False
dtype: bool

#here the pandas bulitin NOne is also shown as the nan value 

***Method	Description
dropna	Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillna	Fill in missing data with some value or using an interpolation method such as "ffill" or "bfill".
isna	Return Boolean values indicating which values are missing/NA.
notna	Negation of isna, returns True for non-NA values and False for NA values.***

FIltering out the missing data 
with the help of fropna() function we can eliminate the na  and only print the valued numbers


In [13]:
data=pd.Series([1,np.nan,3.5,np.nan,7])

In [14]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [15]:
data1 = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                       [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])

In [16]:
data1.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


#Passing how="all" will drop only rows that are all NA:this i=only eleminate the entire row which have all na

In [17]:
data1.dropna(how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [20]:
data1.dropna(axis="columns",how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [21]:
data1.fillna(0)

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,6.5,3.0


In [22]:
data1.fillna({0:"not included",1:0})

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,0.0,
2,not included,0.0,
3,not included,6.5,3.0


In [23]:
data2 = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"],
                       "k2": [1, 1, 2, 3, 3, 4, 4]})

In [24]:
data2

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


#here the 5 and the 6 the row is the duplicate one so we have to remove one duplicate one 
he DataFrame method duplicated returns a Boolean Series indicating whether each row is a duplicate (its column values are exactly equal to those in an earlier row) or not:


Relatedly, drop_duplicates returns a DataFrame with rows where the duplicated array is False filtered out:



In [26]:
data2.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [27]:
data2.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both methods by default consider all of the columns; alternatively, you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates based only on the "k1" column:

In [31]:
data2["v1"]=range(7)

In [32]:
data2

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


duplicated and drop_duplicates by default keep the first observed value combination. Passing keep="last" will return the last one:
but in below there are not any duplicates so all remains 


In [33]:
data2.drop_duplicates(keep="last")

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


Transforming Data Using a Function or Mapping
For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat:

***uppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal:

meat_to_animal = {
  "bacon": "pig",
  "pulled pork": "pig",
  "pastrami": "cow",
  "corned beef": "cow",
  "honey ham": "pig",
  "nova lox": "salmon"
}
The map method on a Series (also discussed in Ch 5.2.5: Function Application and Mapping) accepts a function or dictionary-like object containing a mapping to do the transformation of values:




***

In [35]:
shop = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon",
                                  "pastrami", "corned beef", "bacon",
                                  "pastrami", "honey ham", "nova lox"],
                      "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})



In [36]:
shop

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [37]:
meat_to_animal = {
  "bacon": "pig",
  "pulled pork": "pig",
  "pastrami": "cow",
  "corned beef": "cow",
  "honey ham": "pig",
  "nova lox": "salmon"
}

In [38]:
shop["animal"]=shop["food"].map(meat_to_animal)

In [39]:
shop

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


 The `.map()` function in Python's Pandas library is used to apply a 
function to an entire series (or column) or DataFrame, replacing each 
element with the result of applying the function. It is similar to the 
`apply()` function but is generally faster because it can leverage 
vectorized operations when possible.



The mapping function takes each food item from the `shop["food"]` 
DataFrame and returns an animal corresponding to that food item based on 
the key-value pairs in the `meat_to_animal` dictionary. The resulting 
Series is then used to populate the `shop["animal"]` column with the 
corresponding animals for each food item.




In [40]:
ser= pd.Series([1., -999., 2., -999., -1000., 3.])

In [41]:
ser

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

replacing the data from the series for the correction

In [42]:
ser.replace(-999,np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [44]:
ser.replace([-999,-1000],0)

0    1.0
1    0.0
2    2.0
3    0.0
4    0.0
5    3.0
dtype: float64

In [45]:
ser.replace([-999,-1000],[0,0.5])

0    1.0
1    0.0
2    2.0
3    0.0
4    0.5
5    3.0
dtype: float64

In [46]:
ser.replace({-999:np.nan,-1000:0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

#renaning the axis indexes


In [49]:
data3=pd.DataFrame(np.arange(12).reshape(3,4),index=["Ohio","Colorado","new york"], columns=["one","Two","Three","Four"])



In [50]:
data3

Unnamed: 0,one,Two,Three,Four
Ohio,0,1,2,3
Colorado,4,5,6,7
new york,8,9,10,11


In [58]:
 def transform(x):
     return x[:4].upper()

x[:4]: This slices the first four characters from the input x. If x has fewer than four characters, it takes all the available characters.
.upper(): This converts the sliced string to uppercase.
Let's break it down with an example:

If x is "hello", x[:4] gives "hell", and "hell".upper() returns "HELL".
If x is "hi", x[:4] gives "hi", and "hi".upper() returns "HI".
However, the syntax you provided has an extra dot (..) which will cause a syntax error. The correct version of the function should be:

You can assign to the index attribute, modifying the DataFrame in place:

In [59]:
data3.index.map(transform)


Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [60]:
data3.index = data3.index.map(transform)

In [61]:
data3

Unnamed: 0,one,Two,Three,Four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [62]:
data3.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [65]:

data3.rename(index={"OHIO": "INDIANA"},
             columns={"Three": "peekaboo"})


Unnamed: 0,one,Two,peekaboo,Four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [75]:
data3.rename(index={"OHIO": "Pokhara", "COLO": "KTM", "NEW": "Arva"})


Unnamed: 0,one,Two,Three,Four
Pokhara,0,1,2,3
KTM,4,5,6,7
NEW,8,9,10,11



The rename method in pandas is used to alter the index or column names of a DataFrame. However, the correct usage requires a dictionary for the index parameter to map old index

inplace=True: This modifies the original DataFrame data3 in place.

In [74]:
data3

Unnamed: 0,one,Two,Three,Four
Pokhara,0,1,2,3
KTM,4,5,6,7
NEW,8,9,10,11


#Discreation and binning 

In [76]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]


et’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so, you have to use pandas.cut:

In [77]:
 bins = [18, 25, 35, 60, 100]

In [78]:
 age_categories = pd.cut(ages, bins)

In [79]:
age_categories

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

#Permuting (randomly reordering) a Series or the rows in a DataFrame is possible using the numpy.random.permutation function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering:


A permutation refers to an arrangement of all the members of a set into a sequence or order. In other words, it is a rearrangement of elements in a particular order. 

In [81]:
df=pd.DataFrame(np.arange(5*7).reshape(5,7))

In [82]:
df

Unnamed: 0,0,1,2,3,4,5,6
0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34


In [92]:
sampler=np.random.permutation(5)

In [93]:
sampler

array([4, 1, 0, 2, 3])

In [94]:
df.iloc[sampler]

Unnamed: 0,0,1,2,3,4,5,6
4,28,29,30,31,32,33,34
1,7,8,9,10,11,12,13
0,0,1,2,3,4,5,6
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27


In [95]:
df.take(sampler)

Unnamed: 0,0,1,2,3,4,5,6
4,28,29,30,31,32,33,34
1,7,8,9,10,11,12,13
0,0,1,2,3,4,5,6
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27


In [96]:
 df = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                    "data1": range(6)})

In [97]:
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [100]:
pd.get_dummies(df["key"],dtype=float)

Unnamed: 0,a,b,c
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0
5,0.0,1.0,0.0


#String manipulation

In [101]:
val="a,b,   guido"

In [102]:
val.split(",")

['a', 'b', '   guido']

In [103]:
pieces=[x.strip() for x in val.split(",")]

In [104]:
pieces

['a', 'b', 'guido']

In [105]:
"::".join(pieces)

'a::b::guido'

In [106]:
"guido" in val

True

In [107]:
val.index(",")

1

In [108]:
val.replace(",","::")

'a::b::   guido'

***count	Return the number of nonoverlapping occurrences of substring in the string
endswith	Return True if string ends with suffix
startswith	Return True if string starts with prefix
join	Use string as delimiter for concatenating a sequence of other strings
index	Return starting index of the first occurrence of passed substring if found in the string; otherwise, raises ValueError if not found
find	Return position of first character of first occurrence of substring in the string; like index, but returns –1 if not found
rfind	Return position of first character of last occurrence of substring in the string; returns –1 if not found
replace	Replace occurrences of string with another string
strip, rstrip, lstrip	Trim whitespace, including newlines on both sides, on the right side, or on the left side, respectively
split	Break string into list of substrings using passed delimiter
lower	Convert alphabet characters to lowercase
upper	Convert alphabet characters to uppercase
casefold	Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form
ljust, rjust	Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width***