#### Pandas查询数据的几种方法
1. df.loc方法，根据行、列的标签值查询
2. df.iloc方法， 根据行、列的数字位置查询
3. df.where方法
4. df.query方法

.loc既能查询，又能覆盖写入，强烈推荐！

#### Pandas使用df.loc查询数据的方法
1. 使用单个label值查询数据
2. 使用值列表批量查询
3. 使用数值区间进行范围查询
4. 使用条件表达式查询
5. 调用函数查询

#### 注意
+ 以上查询方法，既适用于行，也适用于列
+ 注意观察降维dataframe>series>值

#### 读取数据
数据为2022世界1000强企业

In [2]:
import pandas as pd

In [3]:
df=pd.read_csv("C:/Users/pxpxz_ct9p1p3/Downloads/Fortune_1000_Companies_by_Revenue.csv")

In [4]:
df.head()

Unnamed: 0,rank,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees
0,1,Walmart,"$572,754",2.40%,"$13,673",1.20%,"$244,860","$409,795",-,2300000
1,2,Amazon,"$469,822",21.70%,"$33,364",56.40%,"$420,549","$1,658,807.30",-,1608000
2,3,Apple,"$365,817",33.30%,"$94,680",64.90%,"$351,002","$2,849,537.60",-,154000
3,4,CVS Health,"$292,111",8.70%,"$7,910",10.20%,"$232,999","$132,839.20",-,258000
4,5,UnitedHealth Group,"$287,597",11.80%,"$17,285",12.20%,"$212,206","$479,830.30",-,350000


In [5]:
df['rank']

0          1
1          2
2          3
3          4
4          5
       ...  
995      996
996      997
997      998
998      999
999    1,000
Name: rank, Length: 1000, dtype: object

In [6]:
df.dtypes

rank                      object
name                      object
revenues                  object
revenue_percent_change    object
profits                   object
profits_percent_change    object
assets                    object
market_value              object
change_in_rank            object
employees                 object
dtype: object

In [7]:
# 将所有columns的data type 变成 string
df = df.astype('string')

In [8]:
df.dtypes

rank                      string
name                      string
revenues                  string
revenue_percent_change    string
profits                   string
profits_percent_change    string
assets                    string
market_value              string
change_in_rank            string
employees                 string
dtype: object

In [9]:
# 设定索引为rank，方便按rank筛选
df.set_index('rank', inplace=True)

In [10]:
df.index

Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10',
       ...
       '991', '992', '993', '994', '995', '996', '997', '998', '999', '1,000'],
      dtype='object', name='rank', length=1000)

In [11]:
df.head()

Unnamed: 0_level_0,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Walmart,"$572,754",2.40%,"$13,673",1.20%,"$244,860","$409,795",-,2300000
2,Amazon,"$469,822",21.70%,"$33,364",56.40%,"$420,549","$1,658,807.30",-,1608000
3,Apple,"$365,817",33.30%,"$94,680",64.90%,"$351,002","$2,849,537.60",-,154000
4,CVS Health,"$292,111",8.70%,"$7,910",10.20%,"$232,999","$132,839.20",-,258000
5,UnitedHealth Group,"$287,597",11.80%,"$17,285",12.20%,"$212,206","$479,830.30",-,350000


#### 1、使用单个label值查询数据
行或者列，都可以只传入单个值，实现精确匹配

In [12]:
# 得到单个值
df.loc['23', 'revenues']

'$133,613 '

In [13]:
# 得到一个Series
df.loc['23', ['revenues', 'profits']]

revenues    $133,613 
profits      $22,065 
Name: 23, dtype: string

#### 2、使用值列表批量查询

In [14]:
# 得到Series
df.loc[['2','3','5'], 'profits']

rank
2    $33,364 
3    $94,680 
5    $17,285 
Name: profits, dtype: string

In [15]:
# 得到DataFrame
df.loc[['2','3','5'], ['revenues', 'profits']]

Unnamed: 0_level_0,revenues,profits
rank,Unnamed: 1_level_1,Unnamed: 2_level_1
2,"$469,822","$33,364"
3,"$365,817","$94,680"
5,"$287,597","$17,285"


In [16]:
# 行index按区间
df.loc['3':'5', 'profits']

rank
3    $94,680 
4     $7,910 
5    $17,285 
Name: profits, dtype: string

In [17]:
# 列index按区间
df.loc['10', 'revenues':'profits']

revenues                  $213,988.80 
revenue_percent_change          12.70%
profits                     $1,539.90 
Name: 10, dtype: string

In [18]:
# 行和列都按区间进行查询
df.loc['3':'5', 'revenues':'profits']

Unnamed: 0_level_0,revenues,revenue_percent_change,profits
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,"$365,817",33.30%,"$94,680"
4,"$292,111",8.70%,"$7,910"
5,"$287,597",11.80%,"$17,285"


#### 4、使用条件表达式查询
bool列表的长度得等于行数或者列数

##### 简单条件查询，revenue低于3000得列表

In [19]:
df.dtypes

name                      string
revenues                  string
revenue_percent_change    string
profits                   string
profits_percent_change    string
assets                    string
market_value              string
change_in_rank            string
employees                 string
dtype: object

In [20]:
# tips: 注意数据类型，是否数值型，字符串型转化成数值型时 1.不可以有$符，2.也不可以有任何分隔符，如‘34,556’中间的‘,’。
df['revenues'] = df['revenues'].str.replace('$','')
df['revenues'] = df['revenues'].str.replace(',','')

df['revenue_percent_change'] = df['revenue_percent_change'].str.replace('%','')
df['revenue_percent_change'] = df['revenue_percent_change'].str.replace('-','')
print(df.dtypes)

name                      string
revenues                  string
revenue_percent_change    string
profits                   string
profits_percent_change    string
assets                    string
market_value              string
change_in_rank            string
employees                 string
dtype: object


  df['revenues'] = df['revenues'].str.replace('$','')


In [21]:
df.head(1)

Unnamed: 0_level_0,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Walmart,572754,2.4,"$13,673",1.20%,"$244,860","$409,795",-,2300000


In [22]:
# revenue列的数据类型变为number
df['revenues'] = pd.to_numeric(df['revenues'])
# change列的数据变为number
df['revenue_percent_change'] = pd.to_numeric(df['revenue_percent_change'])

In [23]:
df.loc[df['revenues']<3000, :] # tips：‘ : ’表示列取全部

Unnamed: 0_level_0,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
829,Verisk Analytics,2998.6,7.7,$666.20,-6.50%,"$7,808.10","$34,616.20",-41,9367
830,Spectrum Brands Holdings,2998.1,24.4,$189.60,93.90%,"$5,340.40","$3,617.60",-209,12100
831,Euronet Worldwide,2995.4,20.7,$70.70,-,"$4,744.30","$6,658.20",16,8800
832,TEGNA,2991.1,1.8,$477,-1.20%,"$6,917.60","$4,962.60",-76,6200
833,Vontier,2990.7,10.6,$413,20.80%,"$4,349.80","$4,087.80",-32,8500
...,...,...,...,...,...,...,...,...,...
996,Vizio Holding,2124.0,4.0,($39.40),-138.40%,$935.80,"$1,705.10",-,800
997,1-800-Flowers.com,2122.2,42.5,$118.70,101.10%,"$1,076.70",$830,-,4800
998,Cowen,2112.8,30.2,$295.60,36.60%,"$8,748.80",$744.10,-,1534
999,Ashland Global Holdings,2111.0,11.2,$220,-,"$6,612","$5,601.90",-130,4100


##### 复杂条件查询
注意，组合条件用&符号，每个判断条件都得带括号

In [24]:
df.loc[(df['revenues']<2500) & (df['revenue_percent_change']>10), :]

Unnamed: 0_level_0,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
922,MYR Group,2498.3,11.2,$85,44.70%,"$1,121.10","$1,594.50",-16,7600
924,Liberty Energy,2470.8,155.8,($179.20),-,"$2,040.70","$2,757.90",-,3601
926,AdaptHealth,2465.1,133.4,$156.20,-,"$5,250.50","$2,146.60",-,10700
929,DexCom,2448.5,27.1,$154.70,-68.70%,"$4,863.60","$49,824.50",47,6650
930,Rollins,2424.3,12.2,$350.70,34.50%,"$1,980.90","$17,260.70",-9,16482
931,Genesco,2422.1,35.6,$114.90,-,"$1,562.10",$868.70,-,11700
932,Bruker,2417.9,21.7,$277.10,75.60%,"$3,650","$9,694.70",26,7765
933,Joann,2417.6,12.5,$56.70,-73.30%,"$2,362.20",$463.20,-142,13530
934,Wolverine World Wide,2414.9,34.8,$68.60,-,"$2,586.40","$1,836.10",-,4400
936,Affiliated Managers Group,2412.4,19.0,$565.70,179.80%,"$8,876.40","$5,606.80",15,4050


In [25]:
# 观察条件表达式的结果：boolean
df['revenues']<2500

rank
1        False
2        False
3        False
4        False
5        False
         ...  
996       True
997       True
998       True
999       True
1,000     True
Name: revenues, Length: 1000, dtype: bool

In [26]:
# 再次观察条件表达式的结果：boolean
(df['revenues']<2500) & (df['revenue_percent_change']>10)

rank
1        False
2        False
3        False
4        False
5        False
         ...  
996      False
997       True
998       True
999       True
1,000     True
Length: 1000, dtype: bool

#### 5、调用函数查询

In [27]:
# 直接写lambda表达式
df.loc[lambda df : (df['revenues']<=3000) & (df['revenue_percent_change']>=10), :]

Unnamed: 0_level_0,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
830,Spectrum Brands Holdings,2998.1,24.4,$189.60,93.90%,"$5,340.40","$3,617.60",-209,12100
831,Euronet Worldwide,2995.4,20.7,$70.70,-,"$4,744.30","$6,658.20",16,8800
833,Vontier,2990.7,10.6,$413,20.80%,"$4,349.80","$4,087.80",-32,8500
834,Cadence Design Systems,2988.2,11.4,$696,17.80%,"$4,386.30","$45,781.70",-29,9298
835,Incyte,2986.3,12.0,$948.60,-,"$4,933.40","$17,577.60",-28,2094
...,...,...,...,...,...,...,...,...,...
994,Genesis Energy,2125.5,16.5,($165.10),-,"$5,905.80","$1,435.40",-,1898
997,1-800-Flowers.com,2122.2,42.5,$118.70,101.10%,"$1,076.70",$830,-,4800
998,Cowen,2112.8,30.2,$295.60,36.60%,"$8,748.80",$744.10,-,1534
999,Ashland Global Holdings,2111.0,11.2,$220,-,"$6,612","$5,601.90",-130,4100


In [28]:
df.dtypes

name                       string
revenues                  float64
revenue_percent_change    float64
profits                    string
profits_percent_change     string
assets                     string
market_value               string
change_in_rank             string
employees                  string
dtype: object

In [37]:
# 编写自己的函数
def query_my_data(df):
    return df.index.str.startswith('5') & df['revenue_percent_change']>=20

df.loc[query_my_data, :] # 函数式编程的本质，函数本身可以像变量一样传递
                                               

TypeError: unsupported operand type(s) for &: 'bool' and 'float'

In [41]:
df.index.str.startswith('5')

array([False, False, False, False,  True, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [39]:
df['revenue_percent_change']>=20

rank
1        False
2         True
3         True
4        False
5        False
         ...  
996      False
997       True
998       True
999      False
1,000     True
Name: revenue_percent_change, Length: 1000, dtype: bool

#### 小结
+ 仔细检查csv文件，有没有不该出现的空格。